Bugzilla – Bug 29857
[915GM] fb pin leak, OOPS with 2.6.36-rc2
Last modified: 2011-02-01 01:31:22 UTC
Description of the problem:
Ever since 2.6.32rc2 (more explicitly git commit 9d0498a2bf7455159b317f19531a3e5db2ecc9c4 "drm/i915: wait for actual vblank, not just 20ms") I've been seeing display misdection and generally strange behaviour on my EeePC 900 (915GM). This can range from tearing while watching video, flickering when xrandr is run, disappearing cursor in X, the screen not being redrawn correctly in X, increased warnings in dmesg, the display being too large when X starts and hitting BUG_ON messages.
Steps to reproduce:
1. Start EeePC with its battery inside.
1. Start EeePC without a battery inside, go to runlevel 1/single and then do a suspend to ram using
echo mem > /sys/power/state
and then resume.
2. Start X
3. Run totem and watch a video.
No refresh or incorrect resolution detection.
Intermittent screen redraws and incorrect resolution detection.
How reproducible is the problem:
The problem is quite reproducible with the above steps.
The problems do not occur in 2.6.35.
On 2.6.36rc2+ the modetest command runs quickly when the system is going to be stable and no displays are found, slowly when fake displays are found and there are going to be problems. On 2.6.35 and below modetest always runs slowly.
Using the modetest command a console shows the strange misdetected displays too.
The following BUG_ON can be reproduced from the console by running
while true; do ./modetest; done
and periodically running
echo mem > /sys/power/state
at the same time:
-----------[ cut here ]------------
kernel BUG at drivers/gpu/drm/i915/i915_gem.c:4025!
invalid opcode: 0000 [#1]
last sysfs file: /sys/devices/pci0000:00/0000:00:1c.2/0000:01:00.0/ieee80211/phy0/rfkill0/uevent
Pid: 1665, comm: modetest Not tainted 2.6.36-rc2-00237-gd4348c6 #74 900/900
EIP: 0060:[<b02cc502>] EFLAGS: 00010246 CPU: 0
EIP is at i915_gem_object_pin+0x162/0x1a0
EAX: ef11a140 EBX: ef11a140 ECX: 0003c47e EDX: 00010000
ESI: ef066c00 EDI: 0003c000 EBP: 00071180 ESP: de0c5b6c
DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068
Process modetest (pid: 1665, ti=de0c4000 task=de0aca40 task.ti=de0c4000)
00000000 00000000 de0aca40 b04a297b ef0cbb80 0000000f ef11a140 00000001
<0> ef066c00 00071180 b02d6da2 00071180 b04a2c39 ef15f000 00000001 b02d6ec4
<0> b0125aec b01141f3 b03ba7b6 000000f0 66ce91a6 b04a417e 66ce91a6 00020f6b
[<b04a297b>] ? schedule+0x12b/0x330
[<b02d6da2>] ? intel_pin_and_fence_fb_obj+0x52/0xb0
[<b04a2c39>] ? _cond_resched+0x29/0x40
[<b02d6ec4>] ? intel_pipe_set_base+0xc4/0x490
[<b0125aec>] ? irq_exit+0x2c/0x70
[<b01141f3>] ? smp_apic_timer_interrupt+0x43/0x80
[<b03ba7b6>] ? raw_pci_read+0x46/0x90
[<b04a417e>] ? apic_timer_interrupt+0x2a/0x30
[<b02d8263>] ? intel_crtc_mode_set+0xb03/0x18e0
[<b0129ea2>] ? run_timer_softirq+0x12/0x1d0
[<b02a87bf>] ? drm_crtc_helper_set_mode+0x29f/0x3c0
[<b02a96b4>] ? drm_crtc_helper_set_config+0x834/0x8d0
[<b02a64bf>] ? drm_fb_helper_force_kernel_mode+0x4f/0x90
[<b02a6508>] ? drm_fb_helper_restore+0x8/0x30
[<b02ae061>] ? drm_lastclose+0x41/0x2d0
[<b02ae8d0>] ? drm_release+0x460/0x5f0
[<b0177dfe>] ? fput+0xbe/0x1f0
[<b017516e>] ? filp_close+0x3e/0x70
[<b0175204>] ? sys_close+0x64/0x90
[<b01029d0>] ? sysenter_do_call+0x12/0x26
Code: 76 00 8d bc 27 00 00 00 00 bd ff 0f 00 00 ba 00 10 00 00 e9 44 ff ff ff 90 89 d8 e8 99 f1 ff ff 85 c0 0f 84 61 ff ff ff 90 eb 8a <0f> 0b eb fe 89 54 24 10 89 54 24 14 89 44 24 0c c7 44 24 08 50
EIP: [<b02cc502>] i915_gem_object_pin+0x162/0x1a0 SS:ESP 0068:de0c5b6c
---[ end trace fa3d3ff06dc95228 ]---
Created attachment 38242 [details]
modetest output when no battery is in the laptop
Created attachment 38243 [details]
2.6.36rc2 modetest output when a battery is in the laptop
Created attachment 38244 [details]
2.6.35 modetest output (with or without a battery in the laptop)
Created attachment 38245 [details]
2.6.34 (and earlier kernels) modetest output (with or without a battery in the laptop)
Created attachment 38246 [details] [review]
Can you try this debug patch. I've converted the BUG_ON to a WARN, so the system should continue to function (also increased the size of pin_count to prevent problems). As soon as the first few WARN_ONs have hit, please attache the complete dmesg.
Created attachment 38247 [details]
dmesg produced with debugging patch
Things got a bit weird after enough suspend/resume cycles. The screen eventually flickered white and stayed black while the system continued to respond. I've chopped off the head of the dmesg and started from the drm messages.
Created attachment 38248 [details] [review]
new debug patch
Use this one instead of the old one. Hopefully this spit out something interesting.
Created attachment 38249 [details]
dmesg from tester with the new debug patch
A two things to note:
13 times "no obj to unpin": Once on boot-up is allright, the other 12 are surplus.
pincount at the end = 14. Subtracting 12 yields 2, which looks like the correct value (one pin for the kernel fb console, one pin because the fb console is the current scanout buffer).
The other hilarity is how often set_base gets called with the dev->mode_config.mutex ...
Created attachment 38250 [details]
dmesg of initial boot with drm.debug=0xe
Created attachment 38256 [details]
dmesg of initial boot with drm.debug=0xe where the pincount rises to 3
This boot was done with init=/bin/bash to double check that there wasn't any chance of anything else being run.
Created attachment 38257 [details]
dmesg of boot followed by suspend/resume with drm.debug=0xe (pincount rises quickly)
This log actually stops at pincount 6 and is gzip'd due to its size.
Created attachment 38557 [details] [review]
Only decouple fb when calling mode_set*()
Created attachment 38559 [details] [review]
Drop fb pin on DPMS_ON
Just testing the theory, we need to rewrite our prepare/commit to not use DPMS (or enable/disable) so that we can move the pin/unpin into enable/disable.
I've pushed a revised pair of patches in -staging.
Created attachment 38567 [details]
Oops in intel_crtc_disable
After reverting 300387c0b57d75e5218e2881d6ad2720657a8bcf to make the issue easier to reproduce thing blew up with drm-intel-staging after a number of suspend/resume cycles with the attached oops.
As the fb pin leak appears to have been exacerbated by the intel_wait_for_vblank() regression, I'm not planning to push for a fix in 2.6.36. Obviously, I will re-assess its priority if we hit the same fb-pincount-leak BUG in -fixes.
Created attachment 38589 [details]
dmesg produced of -staging with debugging patch
I had to manually fix up the debugging patch to apply to -staging but I think the change was trivial so I hopefully did it correctly.
OK I have just been retesting -staging as I am not sure I was using it with my previous comment. With commits b7ffdc988523fb57ac1ef454b77d6ecc01dda4d3 (drm: Use a nondestructive mode for output detect when polling) and 300387c0b57d75e5218e2881d6ad2720657a8bcf (drm/i915: Clear the vblank status bit before polling for the next vblank) in place I can't reproduce the issue because modetest no longer returns quickly. Without these, I get the oops mentioned in comment #15.
Sitsofe thankyou for clarifying that, I was going to ask you later. :)
Concerning the priority, have you seen the fb-pin OOPS just with -fixes?
(In reply to comment #19)
> Concerning the priority, have you seen the fb-pin OOPS just with -fixes?
Sorry, only just realised you asked me a question! Unless I revert 300387c0b57d75e5218e2881d6ad2720657a8bcf (drm/i915: Clear the vblank status bit before polling for the next vblank) I can't reproduce this issue in -fixes (I did 30 suspend/resume cycles after echoing devices in /sys/power/pm_test while running modetest in a loop). When modetest doesn't return quickly I think this issue is going to be incredibly hard to trigger.
Hmm, something more subtle is happening that I can't quite get a handle on - do only disconnects cause the upin leak, why? Ok, this is definitely -next material.
See also #29325.
That should have been bug 29230
*** Bug 32776 has been marked as a duplicate of this bug. ***
Created attachment 42634 [details] [review]
Don't switch fb after a no-op
Author: Chris Wilson <firstname.lastname@example.org>
Date: Fri Jan 28 11:53:03 2011 +0000
drm: Don't switch fb when disabling an output
In drm_crtc_helper_set_config, we call drm_crtc_helper_set_mode which
may return early and do no operation if the crtc is to be disabled. In
this case we merrily swap to the new fb, discarding the old_fb believing
that it has been cleaned up. However, due to the early return, the
old_fb was not presented to the backend for correct reaping, and nor was
the new one - which is about to be reaped via the
drm_helper_disable_unused_functions(), leading to incorrect refcounting
of the pinned objects.
Tested-by: Takashi Iwai <email@example.com>
Signed-off-by: Chris Wilson <firstname.lastname@example.org>