Bug 29857

Summary:

[915GM] fb pin leak, OOPS with 2.6.36-rc2

Product:

DRI

Reporter:

Sitsofe Wheeler <sitsofe>

Component:

DRM/Intel

Assignee:

Chris Wilson <chris>

Status:

CLOSED FIXED

QA Contact:

Severity:

normal

Priority:

medium

CC:

daniel, florian, tfjellstrom

Version:

XOrg git

Hardware:

Other

OS:

Linux (All)

Whiteboard:

i915 platform:

i915 features:

Attachments:

Description	Flags
modetest output when no battery is in the laptop	none
2.6.36rc2 modetest output when a battery is in the laptop	none
2.6.35 modetest output (with or without a battery in the laptop)	none
2.6.34 (and earlier kernels) modetest output (with or without a battery in the laptop)	none
debug patch	none
dmesg produced with debugging patch	none
new debug patch	none
dmesg from tester with the new debug patch	none
dmesg of initial boot with drm.debug=0xe	none
dmesg of initial boot with drm.debug=0xe where the pincount rises to 3	none
dmesg of boot followed by suspend/resume with drm.debug=0xe (pincount rises quickly)	none
Only decouple fb when calling mode_set*()	none
Drop fb pin on DPMS_ON	none
Oops in intel_crtc_disable	none
dmesg produced of -staging with debugging patch	none
Don't switch fb after a no-op	none

Description Sitsofe Wheeler 2010-08-28 10:18:17 UTC

Description of the problem:
Ever since 2.6.32rc2 (more explicitly git commit 9d0498a2bf7455159b317f19531a3e5db2ecc9c4 "drm/i915: wait for actual vblank, not just 20ms") I've been seeing display misdection and generally strange behaviour on my EeePC 900 (915GM). This can range from tearing while watching video, flickering when xrandr is run, disappearing cursor in X, the screen not being redrawn correctly in X, increased warnings in dmesg, the display being too large when X starts and hitting BUG_ON messages.

Steps to reproduce:
1. Start EeePC with its battery inside.
or
1. Start EeePC without a battery inside, go to runlevel 1/single and then do a suspend to ram using
echo mem > /sys/power/state
and then resume.
2. Start X
3. Run totem and watch a video.

Expected results:
No refresh or incorrect resolution detection.

Actual results:
Intermittent screen redraws and incorrect resolution detection.

How reproducible is the problem:
The problem is quite reproducible with the above steps.

Version information:
Ubuntu 10.04
Kernel d4348c678977c7093438bbbf2067c49396ae941b

Additional information:
The problems do not occur in 2.6.35.

On 2.6.36rc2+ the modetest command runs quickly when the system is going to be stable and no displays are found, slowly when fake displays are found and there are going to be problems. On 2.6.35 and below modetest always runs slowly.

Using the modetest command a console shows the strange misdetected displays too.
The following BUG_ON can be reproduced from the console by running
while true; do ./modetest; done
and periodically running
echo mem > /sys/power/state
at the same time:

-----------[ cut here ]------------
kernel BUG at drivers/gpu/drm/i915/i915_gem.c:4025!
invalid opcode: 0000 [#1] 
last sysfs file: /sys/devices/pci0000:00/0000:00:1c.2/0000:01:00.0/ieee80211/phy0/rfkill0/uevent

Pid: 1665, comm: modetest Not tainted 2.6.36-rc2-00237-gd4348c6 #74 900/900
EIP: 0060:[<b02cc502>] EFLAGS: 00010246 CPU: 0
EIP is at i915_gem_object_pin+0x162/0x1a0
EAX: ef11a140 EBX: ef11a140 ECX: 0003c47e EDX: 00010000
ESI: ef066c00 EDI: 0003c000 EBP: 00071180 ESP: de0c5b6c
 DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068
Process modetest (pid: 1665, ti=de0c4000 task=de0aca40 task.ti=de0c4000)
Stack:
 00000000 00000000 de0aca40 b04a297b ef0cbb80 0000000f ef11a140 00000001
<0> ef066c00 00071180 b02d6da2 00071180 b04a2c39 ef15f000 00000001 b02d6ec4
<0> b0125aec b01141f3 b03ba7b6 000000f0 66ce91a6 b04a417e 66ce91a6 00020f6b
Call Trace:
 [<b04a297b>] ? schedule+0x12b/0x330
 [<b02d6da2>] ? intel_pin_and_fence_fb_obj+0x52/0xb0
 [<b04a2c39>] ? _cond_resched+0x29/0x40
 [<b02d6ec4>] ? intel_pipe_set_base+0xc4/0x490
 [<b0125aec>] ? irq_exit+0x2c/0x70
 [<b01141f3>] ? smp_apic_timer_interrupt+0x43/0x80
 [<b03ba7b6>] ? raw_pci_read+0x46/0x90
 [<b04a417e>] ? apic_timer_interrupt+0x2a/0x30
 [<b02d8263>] ? intel_crtc_mode_set+0xb03/0x18e0
 [<b0129ea2>] ? run_timer_softirq+0x12/0x1d0
 [<b02a87bf>] ? drm_crtc_helper_set_mode+0x29f/0x3c0
 [<b02a96b4>] ? drm_crtc_helper_set_config+0x834/0x8d0
 [<b02a64bf>] ? drm_fb_helper_force_kernel_mode+0x4f/0x90
 [<b02a6508>] ? drm_fb_helper_restore+0x8/0x30
 [<b02ae061>] ? drm_lastclose+0x41/0x2d0
 [<b02ae8d0>] ? drm_release+0x460/0x5f0
 [<b0177dfe>] ? fput+0xbe/0x1f0
 [<b017516e>] ? filp_close+0x3e/0x70
 [<b0175204>] ? sys_close+0x64/0x90
 [<b01029d0>] ? sysenter_do_call+0x12/0x26
Code: 76 00 8d bc 27 00 00 00 00 bd ff 0f 00 00 ba 00 10 00 00 e9 44 ff ff ff 90 89 d8 e8 99 f1 ff ff 85 c0 0f 84 61 ff ff ff 90 eb 8a <0f> 0b eb fe 89 54 24 10 89 54 24 14 89 44 24 0c c7 44 24 08 50 
EIP: [<b02cc502>] i915_gem_object_pin+0x162/0x1a0 SS:ESP 0068:de0c5b6c
---[ end trace fa3d3ff06dc95228 ]---

Comment 1 Sitsofe Wheeler 2010-08-28 10:19:41 UTC

Created attachment 38242 [details]
modetest output when no battery is in the laptop

Comment 2 Sitsofe Wheeler 2010-08-28 10:20:19 UTC

Created attachment 38243 [details]
2.6.36rc2 modetest output when a battery is in the laptop

Comment 3 Sitsofe Wheeler 2010-08-28 10:21:11 UTC

Created attachment 38244 [details]
2.6.35 modetest output (with or without a battery in the laptop)

Comment 4 Sitsofe Wheeler 2010-08-28 10:24:25 UTC

Created attachment 38245 [details]
2.6.34 (and earlier kernels) modetest output (with or without a battery in the laptop)

Comment 5 Daniel Vetter 2010-08-28 10:29:23 UTC

Created attachment 38246 [details] [review]
debug patch

Can you try this debug patch. I've converted the BUG_ON to a WARN, so the system should continue to function (also increased the size of pin_count to prevent problems). As soon as the first few WARN_ONs have hit, please attache the complete dmesg.

Comment 6 Sitsofe Wheeler 2010-08-28 11:16:25 UTC

Created attachment 38247 [details]
dmesg produced with debugging patch

Things got a bit weird after enough suspend/resume cycles. The screen eventually flickered white and stayed black while the system continued to respond. I've chopped off the head of the dmesg and started from the drm messages.

Comment 7 Daniel Vetter 2010-08-28 12:46:31 UTC

Created attachment 38248 [details] [review]
new debug patch

Use this one instead of the old one. Hopefully this spit out something interesting.

Comment 8 Daniel Vetter 2010-08-28 13:14:56 UTC

Created attachment 38249 [details]
dmesg from tester with the new debug patch

A two things to note:

13 times "no obj to unpin": Once on boot-up is allright, the other 12 are surplus.
pincount at the end = 14. Subtracting 12 yields 2, which looks like the correct value (one pin for the kernel fb console, one pin because the fb console is the current scanout buffer).

The other hilarity is how often set_base gets called with the dev->mode_config.mutex ...

Comment 9 Sitsofe Wheeler 2010-08-28 15:16:43 UTC

Created attachment 38250 [details]
dmesg of initial boot with drm.debug=0xe

Comment 10 Sitsofe Wheeler 2010-08-29 02:21:56 UTC

Created attachment 38256 [details]
dmesg of initial boot with drm.debug=0xe where the pincount rises to 3

This boot was done with init=/bin/bash to double check that there wasn't any chance of anything else being run.

Comment 11 Sitsofe Wheeler 2010-08-29 03:53:54 UTC

Created attachment 38257 [details]
dmesg of boot followed by suspend/resume with drm.debug=0xe (pincount rises quickly)

This log actually stops at pincount 6 and is gzip'd due to its size.

Comment 12 Chris Wilson 2010-09-08 08:32:57 UTC

Created attachment 38557 [details] [review]
Only decouple fb when calling mode_set*()

Comment 13 Chris Wilson 2010-09-08 08:34:22 UTC

Created attachment 38559 [details] [review]
Drop fb pin on DPMS_ON

Just testing the theory, we need to rewrite our prepare/commit to not use DPMS (or enable/disable) so that we can move the pin/unpin into enable/disable.

Comment 14 Chris Wilson 2010-09-08 08:48:31 UTC

I've pushed a revised pair of patches in -staging.

Comment 15 Sitsofe Wheeler 2010-09-08 15:40:00 UTC

Created attachment 38567 [details]
Oops in intel_crtc_disable

After reverting 300387c0b57d75e5218e2881d6ad2720657a8bcf to make the issue easier to reproduce thing blew up with drm-intel-staging after a number of suspend/resume cycles with the attached oops.

Comment 16 Chris Wilson 2010-09-09 04:07:13 UTC

As the fb pin leak appears to have been exacerbated by the intel_wait_for_vblank() regression, I'm not planning to push for a fix in 2.6.36. Obviously, I will re-assess its priority if we hit the same fb-pincount-leak BUG in -fixes.

Comment 17 Sitsofe Wheeler 2010-09-09 14:37:27 UTC

Created attachment 38589 [details]
dmesg produced of -staging with debugging patch

I had to manually fix up the debugging patch to apply to -staging but I think the change was trivial so I hopefully did it correctly.

Comment 18 Sitsofe Wheeler 2010-09-10 01:30:46 UTC

OK I have just been retesting -staging as I am not sure I was using it with my previous comment. With commits b7ffdc988523fb57ac1ef454b77d6ecc01dda4d3 (drm: Use a nondestructive mode for output detect when polling) and 300387c0b57d75e5218e2881d6ad2720657a8bcf (drm/i915: Clear the vblank status bit before polling for the next vblank) in place I can't reproduce the issue because modetest no longer returns quickly. Without these, I get the oops mentioned in comment #15.

Comment 19 Chris Wilson 2010-09-10 01:38:47 UTC

Sitsofe thankyou for clarifying that, I was going to ask you later. :)

Concerning the priority, have you seen the fb-pin OOPS just with -fixes?

Comment 20 Sitsofe Wheeler 2010-09-12 04:15:31 UTC

(In reply to comment #19)
> 
> Concerning the priority, have you seen the fb-pin OOPS just with -fixes?

Sorry, only just realised you asked me a question! Unless I revert 300387c0b57d75e5218e2881d6ad2720657a8bcf (drm/i915: Clear the vblank status bit before polling for the next vblank) I can't reproduce this issue in -fixes (I did 30 suspend/resume cycles after echoing devices in /sys/power/pm_test while running modetest in a loop). When modetest doesn't return quickly I think this issue is going to be incredibly hard to trigger.

Comment 21 Chris Wilson 2010-09-12 04:24:50 UTC

Hmm, something more subtle is happening that I can't quite get a handle on - do only disconnects cause the upin leak, why? Ok, this is definitely -next material.

See also #29325.

Comment 22 Chris Wilson 2010-09-12 04:26:29 UTC

That should have been bug 29230

Comment 23 Chris Wilson 2011-01-02 05:18:05 UTC

*** Bug 32776 has been marked as a duplicate of this bug. ***

Comment 24 Chris Wilson 2011-01-28 04:02:42 UTC

Created attachment 42634 [details] [review]
Don't switch fb after a no-op

Comment 25 Chris Wilson 2011-02-01 01:31:22 UTC

commit 9334ef755f060e251f3f395caeda1a58b6834ea3
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Jan 28 11:53:03 2011 +0000

    drm: Don't switch fb when disabling an output
    
    In drm_crtc_helper_set_config, we call drm_crtc_helper_set_mode which
    may return early and do no operation if the crtc is to be disabled. In
    this case we merrily swap to the new fb, discarding the old_fb believing
    that it has been cleaned up. However, due to the early return, the
    old_fb was not presented to the backend for correct reaping, and nor was
    the new one - which is about to be reaped via the
    drm_helper_disable_unused_functions(), leading to incorrect refcounting
    of the pinned objects.
    
    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=27722
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=29857
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=29230
    Tested-by: Takashi Iwai <tiwai@suse.de>
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.