Bug 65554

Summary: CPU lock with nouveau_fan_update
Product: xorg Reporter: ddamienn
Component: Driver/nouveauAssignee: Nouveau Project <nouveau>
Status: RESOLVED INVALID QA Contact: Xorg Project Team <xorg-team>
Severity: major    
Priority: medium CC: ddamienn
Version: unspecified   
Hardware: Other   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
kernel log none

Description ddamienn 2013-06-09 00:20:39 UTC
Created attachment 80540 [details]
kernel log

Kernel logs report "Watchdog detected hard LOCKUP on cpu 1". It has occurred once per day for the last three days. Each time, it happens seemingly randomly - monitor is turned off, I'm connected remotely via SSH but system is idle.

This is on ArchLinux with xf86-video-nouveau 1.0.7. The machine is ~2.5 years old and has been very solid until now. My video card is: 01:00.0 VGA compatible controller: NVIDIA Corporation GT216 [GeForce GT 220] (rev a2).

I have attached the kernel log. I also note that I don't know why the 'PTHERM' events would be occurring - this was while the connected monitor was off and I was only connected remotely via SSH.
Comment 1 Ilia Mirkin 2013-06-09 00:40:47 UTC
nouveau_fan_update:
	spin_lock_irqsave(&fan->lock, flags);
	/* schedule next fan update, if not at target speed already */
	if (list_empty(&fan->alarm.head) && target != duty) {
		u16 bump_period = fan->bios.bump_period;
		u16 slow_down_period = fan->bios.slow_down_period;
...
		ptimer->alarm(ptimer, delay * 1000 * 1000, &fan->alarm);

If delay is somehow 0, the ->alarm will cause nouveau_fan_update to get called immediately. Can you add a printk to that function that shows the values? (This may end up totally flooding your dmesg too... but I think the values may be the same across prints.)

e.g.

diff --git a/drivers/gpu/drm/nouveau/core/subdev/therm/fan.c b/drivers/gpu/drm/nouveau/core/subdev/therm/fan.c
index c728380..9453afd 100644
--- a/drivers/gpu/drm/nouveau/core/subdev/therm/fan.c
+++ b/drivers/gpu/drm/nouveau/core/subdev/therm/fan.c
@@ -88,7 +88,7 @@ nouveau_fan_update(struct nouveau_fan *fan, bool immediate, int target)
                        delay = min(bump_period, slow_down_period) ;
                else
                        delay = bump_period;
-
+               nv_info(therm, "Scheduling fan update in %d (slow: %d, bump: %d)\n", delay, slow_down_period, bump_period);
                ptimer->alarm(ptimer, delay * 1000 * 1000, &fan->alarm);
        }
Comment 2 ddamienn 2013-06-09 01:01:28 UTC
(In reply to comment #1)
> If delay is somehow 0, the ->alarm will cause nouveau_fan_update to get
> called immediately. Can you add a printk to that function that shows the
> values? (This may end up totally flooding your dmesg too... but I think the
> values may be the same across prints.)

Thanks for your help. I will try to add the printk. The lock has made the machine inaccessible remotely, so it'll be a couple of days until I can post the results.
Comment 3 ddamienn 2013-06-14 20:24:50 UTC
I've had the debug statement active for the last few days, but the error hasn't re-occurred. I discovered that my video card's fan was only working intermittently, which a clean seems to have fixed. Since the inactive fan likely caused the error I was seeing, I am closing this bug report as invalid.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.