Bug 65554

Summary:

CPU lock with nouveau_fan_update

Product:

xorg

Reporter:

ddamienn

Component:

Driver/nouveau

Assignee:

Nouveau Project <nouveau>

Status:

RESOLVED INVALID

QA Contact:

Xorg Project Team <xorg-team>

Severity:

major

Priority:

medium

CC:

ddamienn

Version:

unspecified

Hardware:

Other

OS:

Linux (All)

Whiteboard:

i915 platform:

i915 features:

Attachments:

Description	Flags
kernel log	none

Description ddamienn 2013-06-09 00:20:39 UTC

Created attachment 80540 [details]
kernel log

Kernel logs report "Watchdog detected hard LOCKUP on cpu 1". It has occurred once per day for the last three days. Each time, it happens seemingly randomly - monitor is turned off, I'm connected remotely via SSH but system is idle.

This is on ArchLinux with xf86-video-nouveau 1.0.7. The machine is ~2.5 years old and has been very solid until now. My video card is: 01:00.0 VGA compatible controller: NVIDIA Corporation GT216 [GeForce GT 220] (rev a2).

I have attached the kernel log. I also note that I don't know why the 'PTHERM' events would be occurring - this was while the connected monitor was off and I was only connected remotely via SSH.

Comment 1 Ilia Mirkin 2013-06-09 00:40:47 UTC

nouveau_fan_update:
	spin_lock_irqsave(&fan->lock, flags);
	/* schedule next fan update, if not at target speed already */
	if (list_empty(&fan->alarm.head) && target != duty) {
		u16 bump_period = fan->bios.bump_period;
		u16 slow_down_period = fan->bios.slow_down_period;
...
		ptimer->alarm(ptimer, delay * 1000 * 1000, &fan->alarm);

If delay is somehow 0, the ->alarm will cause nouveau_fan_update to get called immediately. Can you add a printk to that function that shows the values? (This may end up totally flooding your dmesg too... but I think the values may be the same across prints.)

e.g.

diff --git a/drivers/gpu/drm/nouveau/core/subdev/therm/fan.c b/drivers/gpu/drm/nouveau/core/subdev/therm/fan.c
index c728380..9453afd 100644
--- a/drivers/gpu/drm/nouveau/core/subdev/therm/fan.c
+++ b/drivers/gpu/drm/nouveau/core/subdev/therm/fan.c
@@ -88,7 +88,7 @@ nouveau_fan_update(struct nouveau_fan *fan, bool immediate, int target)
                        delay = min(bump_period, slow_down_period) ;
                else
                        delay = bump_period;
-
+               nv_info(therm, "Scheduling fan update in %d (slow: %d, bump: %d)\n", delay, slow_down_period, bump_period);
                ptimer->alarm(ptimer, delay * 1000 * 1000, &fan->alarm);
        }

Comment 2 ddamienn 2013-06-09 01:01:28 UTC

(In reply to comment #1)
> If delay is somehow 0, the ->alarm will cause nouveau_fan_update to get
> called immediately. Can you add a printk to that function that shows the
> values? (This may end up totally flooding your dmesg too... but I think the
> values may be the same across prints.)

Thanks for your help. I will try to add the printk. The lock has made the machine inaccessible remotely, so it'll be a couple of days until I can post the results.

Comment 3 ddamienn 2013-06-14 20:24:50 UTC

I've had the debug statement active for the last few days, but the error hasn't re-occurred. I discovered that my video card's fan was only working intermittently, which a clean seems to have fixed. Since the inactive fan likely caused the error I was seeing, I am closing this bug report as invalid.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.