Bug 28079 - [i965] "glresize" causes memory leak
Summary: [i965] "glresize" causes memory leak
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: Other All
: medium normal
Assignee: Jesse Barnes
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-05-12 19:46 UTC by Nick Bowler
Modified: 2017-07-24 23:08 UTC (History)
2 users (show)

See Also:
i915 platform:
i915 features:


Attachments
netconsole output (2.71 KB, text/plain)
2010-05-12 19:46 UTC, Nick Bowler
no flags Details
Kernel configuration. (57.24 KB, text/plain)
2010-05-14 04:44 UTC, Nick Bowler
no flags Details
Clear unpin work under spinlock along error path. (2.11 KB, patch)
2010-05-14 11:30 UTC, Chris Wilson
no flags Details | Splinter Review
Full kernel log. (54.88 KB, text/plain)
2010-05-14 17:22 UTC, Nick Bowler
no flags Details
Remove the DRM_DEBUG, it's too dangerous (1.74 KB, patch)
2010-05-15 01:55 UTC, Chris Wilson
no flags Details | Splinter Review

Description Nick Bowler 2010-05-12 19:46:33 UTC
Created attachment 35603 [details]
netconsole output

This was originally described in bug 27922 and occurs when running the "glresize" test program there, which creates an OpenGL window and then rapidly resizes it, switching back and forth between filling the entire screen and its original size.

After running glresize enough times (it often segfaults a few times first), the kernel panics.  I used to have to disable vm_overcommit to get this behaviour, but with latest git libdrm and xf86-video-intel, it happens regardless of the overcommit setting.

Attached is the netconsole log output.  Looks like the kernel proceeded to deadlock while handling the panic, as the output got truncated and the system ceased to respond to sysrq.

This is with latest Linus' git kernel.
Comment 1 Nick Bowler 2010-05-12 19:46:59 UTC
Whoops, forgot to mention that I'm on a T500 laptop with a GM45.
Comment 2 Chris Wilson 2010-05-14 02:42:16 UTC
Hi Nick, I haven't yet reproduced this one. It looks like we hit an error during preparing the page flip and continued on regardless, resulting in a dereference of an invalid buffer later.

Is there anything else you can think of that might be a factor?
Comment 3 Nick Bowler 2010-05-14 04:44:51 UTC
Created attachment 35649 [details]
Kernel configuration.

Here's my kernel config, in case it helps.

I also discovered that booting with "nosmp" allows me to observe the same crash consistently: I never have to run the program more than once and none of the *ERROR* lines get printed.  The first line to appear on netconsole is the BUG: one in this case (every time).
Comment 4 Nick Bowler 2010-05-14 07:23:02 UTC
(In reply to comment #2)
> Is there anything else you can think of that might be a factor?

A bit of idle speculation: it used to "feel" like it was caused by memory pressure (the program used to cause the OOM killer to start killing things seemingly at random, which is why I had to disable overcommit to trigger the panic before).  It still does, a bit, since the program runs for a few seconds first.

Thinking of that, I realize that the system has no swap (2G of RAM), which may be relevant? (I'm not able to test this theory right now).
Comment 5 Chris Wilson 2010-05-14 11:30:40 UTC
Created attachment 35653 [details] [review]
Clear unpin work under spinlock along error path.

I looked through my list of outstanding patches and to my surprise this is still not upstream... It's the only explanation I've found for an unhandled error during pageflip.
Comment 6 Nick Bowler 2010-05-14 17:22:06 UTC
Created attachment 35660 [details]
Full kernel log.

OK, I got around to testing the latest stuff.  First, I updated each of linux,
libdrm and xf86-video-intel to the latest git and tried to reproduce the crash:

  * The OOM killer problem is back, so I once again need to disable vm
    overcommit to get this.  One step at a time, though...
  * Sure enough, the panic occurs just like it did before.

Next, I built a kernel with the patch from this bug applied.  Unfortunately, it
did not solve the issue.  However, it seems to have made the "nosmp" case the
same as normal: I sometimes need to run the program multiple times to cause a
panic.

In case it's interesting, I have attached the complete kernel log, booted with
"nosmp".
Comment 7 Chris Wilson 2010-05-14 17:32:57 UTC
I'm still not sure where the bo leak is coming from. I don't see that behaviour on i945, but it did come close on my gm45 by using over 1GiB in objects before the cache was reaped, and climbed rapidly back to 1GiB, ad infinitum. So there could be some behaviour on i965 that I'm not understanding yet.

But I don't really want to fix that whilst it producing such a nice reproducible test case for a kernel panic. ;-)

Back to hunting for a ENOMEM leading to a NULL dereference...

Nick, it would speed the search immensely if you could extract the line number for the address. I'm sure google knows how, I cheat and use builtin modules and addr2line.
Comment 8 Nick Bowler 2010-05-14 18:27:17 UTC
(In reply to comment #7)
> I'm still not sure where the bo leak is coming from. I don't see that
> behaviour on i945, but it did come close on my gm45 by using over 1GiB in
> objects before the cache was reaped, and climbed rapidly back to 1GiB, ad
> infinitum. So there could be some behaviour on i965 that I'm not
> understanding yet.

I also observed this behaviour in the first post of bug 27922, before writing
"glresize".  I assumed it no longer occurred because glresize adjusts the
window size at a significantly faster rate than my keyboard repeat allows.

> But I don't really want to fix that whilst it producing such a nice
> reproducible test case for a kernel panic. ;-)

Sounds completely reasonable to me.

> Back to hunting for a ENOMEM leading to a NULL dereference...
> 
> Nick, it would speed the search immensely if you could extract the line
> number for the address. I'm sure google knows how, I cheat and use builtin
> modules and addr2line.

Since I have the i915 driver built-in, I rebuilt the kernel with debug symbols,
crashed it again and used addr2line on the resulting address.  The culprit
appears to be:

  drivers/gpu/drm/i915/intel_display.c:4159

which is the line

  obj_priv = to_intel_bo(work->pending_flip_obj);
Comment 9 Chris Wilson 2010-05-15 01:55:13 UTC
Created attachment 35668 [details] [review]
Remove the DRM_DEBUG, it's too dangerous

Sigh, that's quite tame bug. No excuse to hit the Ramos.
Comment 10 Nick Bowler 2010-05-15 06:43:42 UTC
Yup, you've got it.  I backed out the earlier patch to test: the second alone is enough and the kernel no longer panics!

Seems like the only problem left with "glresize" is the bo leak problem.
Comment 11 Chris Wilson 2010-05-27 05:28:50 UTC
Eric has pushed the kernel patch into drm-intel-next, so that should hopefully make it into 2.6.35 and stable.

Hmm, bo leak...
Comment 12 Chris Wilson 2010-05-27 09:19:09 UTC
I've just tried again on a swapless g45 with 1GiB of memory, and I'm not seeing the same behaviour you are. Here, the bo seem to be allocated until we fill the aperture and then the cache is reaped, so there does not appear to be a leak.

This is using current ddx and 2.6.34 + anholt/drm-intel-next and a few trivial error cleanup patches.
Comment 13 Nick Bowler 2010-05-27 09:26:20 UTC
I haven't actually tried it for a couple weeks.  After 2.6.35-rc1 is out, I'll re-test and file a new bug if it's still reproducible.
Comment 14 Chris Wilson 2010-05-28 01:20:25 UTC
Hmm, checked on the machine this morning to find the OOM-killer had struck down X. So whilst there may not be a bo leak (as reported by /sys/kernel/debug/dri/0/gem_objects), something is leaking...
Comment 15 Chris Wilson 2010-05-28 05:26:39 UTC
Retitling to emphasize the remaining bug on g45.
Comment 16 Chris Wilson 2010-05-28 07:31:19 UTC
(In reply to comment #14)
> Hmm, checked on the machine this morning to find the OOM-killer had struck down
> X. So whilst there may not be a bo leak (as reported by
> /sys/kernel/debug/dri/0/gem_objects), something is leaking...

This is just a bug in the test case which is leaking events. Nick, everything seems to be behaving itself now, I think. Thanks for the bug report, and do please keep filing them!


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.