Summary: | Regression: Severe memory leak when running Second Life | ||
---|---|---|---|
Product: | Mesa | Reporter: | Sean McNamara <gm.potato.ul> |
Component: | Drivers/Gallium/r600 | Assignee: | Default DRI bug account <dri-devel> |
Status: | RESOLVED FIXED | QA Contact: | |
Severity: | major | ||
Priority: | medium | CC: | sa, smcnam |
Version: | git | ||
Hardware: | Other | ||
OS: | Linux (All) | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: | proposed patch |
Description
Sean McNamara
2011-05-13 03:40:12 UTC
(In reply to comment #0) > > Mesa 7.10.2 tarball OR Mesa 7.10 branch from git: Correct rendering; no > crashing. Can you bisect mesa and see what commits between 7.10 and master broke it? (In reply to comment #1) > (In reply to comment #0) > > > > Mesa 7.10.2 tarball OR Mesa 7.10 branch from git: Correct rendering; no > > crashing. > > Can you bisect mesa and see what commits between 7.10 and master broke it? debc45bca07a5dfad4199079f080b35c19f00e85 starts the memory leaks (very noticeable on this commit, just watch gnome-system-monitor's total memory usage grow by 25 - 100 MB/s) I got two kernel hard-locks with 8c631cfeae29b5236928f759e222aa35e6e4984c also, but it doesn't seem to have the memory leak. Anyway, I'm not sure *exactly* which commit because I was fighting hardlocks and memory leaks at the same time, and sometimes they are intermittent, but if you look at the r600g commit series by Marek on or around January 28 - 29, I think the problem lies somewhere in there. That's my best guess, and it's certainly more relevant than what I came up with the last time I bisected, which was the merge of dricore support (and I'm not even using dricore, btw). Followup: I also get the following messages spewed to dmesg every single frame, regardless if I'm using Mesa 7.10.2 or git master or anything in between. I'm not sure if this is related to the problem or just noise. But I suspect that it's unrelated, because I don't experience any symptoms of failure (crash, OOM) with 7.10.2, and the messages still get spewed. The frequency is about once every frame, or a bit more often. Maybe as frequently as once per kernel tick (~1000 Hz timer). [ 564.159042] DRHD: handling fault status reg 2 [ 564.159248] DMAR:[DMA Write] Request device [04:00.0] fault addr 44bc0000 [ 564.159249] DMAR:[fault reason 05] PTE Write access is not set The DRHD line appears less frequently, but the DMAR lines are always grouped together like that. I tried running other GL apps and it doesn't happen there. (In reply to comment #3) > Followup: I also get the following messages spewed to dmesg every single frame, > regardless if I'm using Mesa 7.10.2 or git master or anything in between. I'm > not sure if this is related to the problem or just noise. But I suspect that > it's unrelated, because I don't experience any symptoms of failure (crash, OOM) > with 7.10.2, and the messages still get spewed. The frequency is about once > every frame, or a bit more often. Maybe as frequently as once per kernel tick > (~1000 Hz timer). > > [ 564.159042] DRHD: handling fault status reg 2 > [ 564.159248] DMAR:[DMA Write] Request device [04:00.0] fault addr 44bc0000 > [ 564.159249] DMAR:[fault reason 05] PTE Write access is not set > > The DRHD line appears less frequently, but the DMAR lines are always grouped > together like that. I tried running other GL apps and it doesn't happen there. The DMAR and DRHD lines appear to be related to my ASUS P6T Deluxe's broken BIOS. Even the latest version of this BIOS is unable to handle the Core i7's VT-d feature correctly. On boot-up, the kernel complains very loudly about my broken BIOS and the incorrect functioning of the chipset IOMMU; when I disable VT-d, all of this complaining goes away, as do these DMAR / DRHD faults. Unfortunately, when disabling VT-d, I still have the memory leak. I tried to reproduce this on my 5670, but couldn't. I don't get a memory leak or a hard lock, instead the viewer (tried both SL and Imprudence) segfaults after a short while. This is with current git master, with 7.10 everything seems to be working. I will try to bisect and file a new bug in case it's not the same problem. Created attachment 47812 [details] [review] proposed patch I'm having the same memory leak issue (though I kill the game when swapping starts so I did not experience kernel lock). The attached patch seems to fix the problem; could you please test it with your setup ? (In reply to comment #6) > Created an attachment (id=47812) [details] > proposed patch > > I'm having the same memory leak issue (though I kill the game when swapping > starts so I did not experience kernel lock). > > The attached patch seems to fix the problem; could you please test it with your > setup ? Your patch fixes the problem here too! Great job spotting it!! I am curious, what research process did you undertake in order to isolate the leak? I would like to learn how to do that. I am a software engineer, but apparently there's some tool I should know about that would help me narrow down this kind of leak to a specific part of the code. Anyway, I think you should really put this patch up for inclusion into git master. I don't see any adverse effects from the patch, and it definitely resolves the issue (I've run imprudence for 45 minutes and it's only using 3.8% of my 6GB of memory, whereas before it would get OOM Killed within 5 minutes). (In reply to comment #7) > Your patch fixes the problem here too! Great job spotting it!! Did the patch also solve the kernel hang? If not, it should be easier to track down now. (In reply to comment #6) > Created an attachment (id=47812) [details] > proposed patch > > I'm having the same memory leak issue (though I kill the game when swapping > starts so I did not experience kernel lock). > > The attached patch seems to fix the problem; could you please test it with your > setup ? Also, I believe I fixed my kernel hardlocks by disabling my motherboard chipset's buggy IOMMU. The IOMMU for Intel VT-D on the ASUS P6T Deluxe motherboard is so buggy that it makes VT-d completely unusable on this motherboard. ASUS seems unwilling to patch their broken BIOS, maybe because the hardware is just broken. Anyway, I can't reproduce the kernel hardlock with Linux 3.0-rc2 and the IOMMU disabled in the BIOS (VT-d set to disabled). (In reply to comment #6) > Created an attachment (id=47812) [details] > proposed patch > > I'm having the same memory leak issue (though I kill the game when swapping > starts so I did not experience kernel lock). > > The attached patch seems to fix the problem; could you please test it with your > setup ? The patch also fixes the segfaults I experienced. Same bug, different behaviour I guess. (In reply to comment #10) > (In reply to comment #6) > > Created an attachment (id=47812) [details] [details] > > proposed patch > > > > I'm having the same memory leak issue (though I kill the game when swapping > > starts so I did not experience kernel lock). > > > > The attached patch seems to fix the problem; could you please test it with your > > setup ? > > The patch also fixes the segfaults I experienced. Same bug, different behaviour > I guess. Hmm. I never got it to segfault, IOMMU or not :) The closest I could get to a segfault is the kernel sending it SIGKILL, which results in the console saying "Killed" and dmesg warning you that the oomkiller had to kill the worst RAM offender. Anyway, we should really mark this bug as fixed, even though the original bug title is a *horrible* misnomer for the actual symptom and fix here... This should not be marked as RESOLVED before the fix lands in git. (In reply to comment #7) > Your patch fixes the problem here too! Great job spotting it!! cool :) > > I am curious, what research process did you undertake in order to isolate the > leak? I've mainly used valgrind (massif and memcheck tools) to see where the memory allocations came from (massif) and if some of them were really memory leak (ie : memory allocated with with no reference to them). Those 2 tools pointed to the same place, so I start digging there to understand how could this be possible, and eventually found a place where a refcounted resource was mis-released. This part was done mainly with printf logging and cgdb. Thanks for testing, I'll propose the patch for inclusion. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.