Summary: | [i965] "glresize" causes memory leak | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | Nick Bowler <nbowler> | ||||||||||||
Component: | DRM/Intel | Assignee: | Jesse Barnes <jbarnes> | ||||||||||||
Status: | CLOSED FIXED | QA Contact: | |||||||||||||
Severity: | normal | ||||||||||||||
Priority: | medium | CC: | brian, chris | ||||||||||||
Version: | unspecified | ||||||||||||||
Hardware: | Other | ||||||||||||||
OS: | All | ||||||||||||||
Whiteboard: | |||||||||||||||
i915 platform: | i915 features: | ||||||||||||||
Attachments: |
|
Description
Nick Bowler
2010-05-12 19:46:33 UTC
Whoops, forgot to mention that I'm on a T500 laptop with a GM45. Hi Nick, I haven't yet reproduced this one. It looks like we hit an error during preparing the page flip and continued on regardless, resulting in a dereference of an invalid buffer later. Is there anything else you can think of that might be a factor? Created attachment 35649 [details]
Kernel configuration.
Here's my kernel config, in case it helps.
I also discovered that booting with "nosmp" allows me to observe the same crash consistently: I never have to run the program more than once and none of the *ERROR* lines get printed. The first line to appear on netconsole is the BUG: one in this case (every time).
(In reply to comment #2) > Is there anything else you can think of that might be a factor? A bit of idle speculation: it used to "feel" like it was caused by memory pressure (the program used to cause the OOM killer to start killing things seemingly at random, which is why I had to disable overcommit to trigger the panic before). It still does, a bit, since the program runs for a few seconds first. Thinking of that, I realize that the system has no swap (2G of RAM), which may be relevant? (I'm not able to test this theory right now). Created attachment 35653 [details] [review] Clear unpin work under spinlock along error path. I looked through my list of outstanding patches and to my surprise this is still not upstream... It's the only explanation I've found for an unhandled error during pageflip. Created attachment 35660 [details]
Full kernel log.
OK, I got around to testing the latest stuff. First, I updated each of linux,
libdrm and xf86-video-intel to the latest git and tried to reproduce the crash:
* The OOM killer problem is back, so I once again need to disable vm
overcommit to get this. One step at a time, though...
* Sure enough, the panic occurs just like it did before.
Next, I built a kernel with the patch from this bug applied. Unfortunately, it
did not solve the issue. However, it seems to have made the "nosmp" case the
same as normal: I sometimes need to run the program multiple times to cause a
panic.
In case it's interesting, I have attached the complete kernel log, booted with
"nosmp".
I'm still not sure where the bo leak is coming from. I don't see that behaviour on i945, but it did come close on my gm45 by using over 1GiB in objects before the cache was reaped, and climbed rapidly back to 1GiB, ad infinitum. So there could be some behaviour on i965 that I'm not understanding yet. But I don't really want to fix that whilst it producing such a nice reproducible test case for a kernel panic. ;-) Back to hunting for a ENOMEM leading to a NULL dereference... Nick, it would speed the search immensely if you could extract the line number for the address. I'm sure google knows how, I cheat and use builtin modules and addr2line. (In reply to comment #7) > I'm still not sure where the bo leak is coming from. I don't see that > behaviour on i945, but it did come close on my gm45 by using over 1GiB in > objects before the cache was reaped, and climbed rapidly back to 1GiB, ad > infinitum. So there could be some behaviour on i965 that I'm not > understanding yet. I also observed this behaviour in the first post of bug 27922, before writing "glresize". I assumed it no longer occurred because glresize adjusts the window size at a significantly faster rate than my keyboard repeat allows. > But I don't really want to fix that whilst it producing such a nice > reproducible test case for a kernel panic. ;-) Sounds completely reasonable to me. > Back to hunting for a ENOMEM leading to a NULL dereference... > > Nick, it would speed the search immensely if you could extract the line > number for the address. I'm sure google knows how, I cheat and use builtin > modules and addr2line. Since I have the i915 driver built-in, I rebuilt the kernel with debug symbols, crashed it again and used addr2line on the resulting address. The culprit appears to be: drivers/gpu/drm/i915/intel_display.c:4159 which is the line obj_priv = to_intel_bo(work->pending_flip_obj); Created attachment 35668 [details] [review] Remove the DRM_DEBUG, it's too dangerous Sigh, that's quite tame bug. No excuse to hit the Ramos. Yup, you've got it. I backed out the earlier patch to test: the second alone is enough and the kernel no longer panics! Seems like the only problem left with "glresize" is the bo leak problem. Eric has pushed the kernel patch into drm-intel-next, so that should hopefully make it into 2.6.35 and stable. Hmm, bo leak... I've just tried again on a swapless g45 with 1GiB of memory, and I'm not seeing the same behaviour you are. Here, the bo seem to be allocated until we fill the aperture and then the cache is reaped, so there does not appear to be a leak. This is using current ddx and 2.6.34 + anholt/drm-intel-next and a few trivial error cleanup patches. I haven't actually tried it for a couple weeks. After 2.6.35-rc1 is out, I'll re-test and file a new bug if it's still reproducible. Hmm, checked on the machine this morning to find the OOM-killer had struck down X. So whilst there may not be a bo leak (as reported by /sys/kernel/debug/dri/0/gem_objects), something is leaking... Retitling to emphasize the remaining bug on g45. (In reply to comment #14) > Hmm, checked on the machine this morning to find the OOM-killer had struck down > X. So whilst there may not be a bo leak (as reported by > /sys/kernel/debug/dri/0/gem_objects), something is leaking... This is just a bug in the test case which is leaking events. Nick, everything seems to be behaving itself now, I think. Thanks for the bug report, and do please keep filing them! |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.