Summary: | fan alarm timer/update called from interrupt in an infinite loop | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | xorg | Reporter: | Olivier Diotte <vhann3000+freedesktop> | ||||||
Component: | Driver/nouveau | Assignee: | Nouveau Project <nouveau> | ||||||
Status: | RESOLVED MOVED | QA Contact: | Xorg Project Team <xorg-team> | ||||||
Severity: | normal | ||||||||
Priority: | medium | CC: | imirkin, jim | ||||||
Version: | unspecified | ||||||||
Hardware: | Other | ||||||||
OS: | All | ||||||||
Whiteboard: | |||||||||
i915 platform: | i915 features: | ||||||||
Attachments: |
|
Description
Olivier Diotte
2015-07-21 14:52:15 UTC
This happens on 4.1.0 (where I saw the previous dump) and v4.0.0 (where I have nothing in the logs). I did not use Ctrl+Alt+Delete nor the magic keys on v4.0.0, so maybe those key combinations are what generate the crash messages? Looks similar to https://bugs.freedesktop.org/show_bug.cgi?id=91355 to my outsider's eye. This doesn't happen on 3.13.0 I have the exact same system (slightly older BIOS version) and I'm seeing the same lockup, which occurred while idling over the weekend. Seems to be stemming from nvkm_fantog_update. This is 4.2-rc6. I also saw lockups on 4.1.0 which I'm guessing were the same cause, but I wasn't able to get any logs from those. 3.13 was working fine. [176041.174195] ------------[ cut here ]------------ [176041.174202] WARNING: CPU: 0 PID: 0 at /build/linux-Tql0Dq/linux-4.2~rc6/kernel/watchdog.c:311 watchdog_overflow_callback+0x84/0xb0() [176041.174203] Watchdog detected hard LOCKUP on cpu 0 [176041.174203] Modules linked in: ipt_MASQUERADE nf_nat_masquerade_ipv4 xt_conntrack ipt_REJECT nf_reject_ipv4 xt_tcpudp iptable_filter nf_nat_h323 nf_conntrack_h323 nf_nat_pptp nf_nat_proto_gre nf_conntrack_pptp nf_conntrack_proto_gre nf_nat_tftp nf_conntrack_tftp nf_nat_sip nf_conntrack_sip nf_nat_irc nf_conntrack_irc nf_nat_ftp nf_conntrack_ftp iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack ip_tables x_tables loop sdhci_pci sdhci mmc_core ctr ccm binfmt_misc nfsd auth_rpcgss oid_registry nfs_acl nfs lockd grace fscache sunrpc joydev hid_generic usbhid hid uas usb_storage ath3k btusb btrtl btbcm btintel bluetooth x86_pkg_temp_thermal intel_powerclamp intel_rapl iosf_mbi coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel sha256_ssse3 sha256_generic hmac drbg ansi_cprng snd_hda_codec_hdmi snd_hda_codec_realtek aesni_intel nouveau mxm_wmi snd_hda_codec_generic arc4 wmi ttm drm_kms_helper drm aes_x86_64 lrw snd_hda_intel gf128mul snd_hda_codec glue_helper snd_hda_core ablk_helper snd_hwdep ath9k ath9k_common ath9k_hw cryptd iTCO_wdt iTCO_vendor_support snd_pcm xhci_pci evdev dcdbas psmouse serio_raw pcspkr i2c_algo_bit ath mac80211 cfg80211 snd_timer xhci_hcd snd rfkill soundcore video battery i2c_i801 lpc_ich mfd_core mei_me mei shpchp processor button efi_pstore efivars fuse parport_pc ppdev lp parport autofs4 ext4 crc16 mbcache jbd2 sg sr_mod cdrom sd_mod r8169 mii ahci crc32c_intel libahci libata ehci_pci ehci_hcd scsi_mod usbcore usb_common fan thermal thermal_sys [176041.174257] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.2.0-rc6-amd64 #1 Debian 4.2~rc6-1~exp1 [176041.174258] Hardware name: Dell Inc. XPS 8700/0KWVT8, BIOS A08 04/16/2014 [176041.174259] 0000000000000000 ffffffff81729448 ffffffff81546a1d ffff88021ec05b60 [176041.174260] ffffffff8106e571 ffff880214e09000 0000000000000000 ffff88021ec05c40 [176041.174262] ffff88021ec05ef8 0000000000000000 ffffffff8106e5ea ffffffff81729420 [176041.174263] Call Trace: [176041.174264] <NMI> [<ffffffff81546a1d>] ? dump_stack+0x40/0x50 [176041.174270] [<ffffffff8106e571>] ? warn_slowpath_common+0x81/0xb0 [176041.174272] [<ffffffff8106e5ea>] ? warn_slowpath_fmt+0x4a/0x50 [176041.174273] [<ffffffff811080c4>] ? watchdog_overflow_callback+0x84/0xb0 [176041.174276] [<ffffffff811427d2>] ? __perf_event_overflow+0x82/0x1b0 [176041.174278] [<ffffffff81030e24>] ? intel_pmu_handle_irq+0x1d4/0x440 [176041.174281] [<ffffffff81027fa5>] ? perf_event_nmi_handler+0x25/0x40 [176041.174283] [<ffffffff8101c804>] ? native_sched_clock+0x24/0x80 [176041.174284] [<ffffffff810176d2>] ? nmi_handle+0x82/0x110 [176041.174285] [<ffffffff81017c42>] ? default_do_nmi+0xc2/0x110 [176041.174286] [<ffffffff81017d6b>] ? do_nmi+0xdb/0x130 [176041.174289] [<ffffffff8154e414>] ? end_repeat_nmi+0x1a/0x1e [176041.174290] [<ffffffff810aedf5>] ? native_queued_spin_lock_slowpath+0x165/0x180 [176041.174292] [<ffffffff810aedf5>] ? native_queued_spin_lock_slowpath+0x165/0x180 [176041.174293] [<ffffffff810aedf5>] ? native_queued_spin_lock_slowpath+0x165/0x180 [176041.174293] <<EOE>> <IRQ> [<ffffffff8154bf12>] ? _raw_spin_lock_irqsave+0x32/0x40 [176041.174306] [<ffffffffa060742c>] ? nvkm_fantog_update+0x4c/0x110 [nouveau] [176041.174313] [<ffffffffa0607542>] ? nvkm_fantog_set+0x32/0x40 [nouveau] [176041.174320] [<ffffffffa0606a89>] ? nvkm_fan_update+0xe9/0x1e0 [nouveau] [176041.174327] [<ffffffffa060992c>] ? nv04_timer_alarm_trigger+0x10c/0x150 [nouveau] [176041.174333] [<ffffffffa06074ee>] ? nvkm_fantog_update+0x10e/0x110 [nouveau] [176041.174339] [<ffffffffa060992c>] ? nv04_timer_alarm_trigger+0x10c/0x150 [nouveau] [176041.174344] [<ffffffffa0609a8a>] ? nv04_timer_intr+0x6a/0x90 [nouveau] [176041.174351] [<ffffffffa0600e2d>] ? nvkm_mc_intr+0xed/0x160 [nouveau] [176041.174353] [<ffffffff810bd7eb>] ? handle_irq_event_percpu+0x6b/0x180 [176041.174354] [<ffffffff810bd940>] ? handle_irq_event+0x40/0x60 [176041.174356] [<ffffffff810c092b>] ? handle_edge_irq+0x7b/0x140 [176041.174357] [<ffffffff81015eed>] ? handle_irq+0x1d/0x30 [176041.174359] [<ffffffff8154ecf6>] ? do_IRQ+0x46/0xd0 [176041.174361] [<ffffffff8154ccab>] ? common_interrupt+0x6b/0x6b [176041.174361] <EOI> [<ffffffff81424278>] ? cpuidle_enter_state+0xe8/0x220 [176041.174365] [<ffffffff81424253>] ? cpuidle_enter_state+0xc3/0x220 [176041.174367] [<ffffffff810aa0f6>] ? cpu_startup_entry+0x256/0x310 [176041.174368] [<ffffffff8192af5e>] ? start_kernel+0x480/0x48b [176041.174370] [<ffffffff8192a120>] ? early_idt_handler_array+0x120/0x120 [176041.174371] [<ffffffff8192a120>] ? early_idt_handler_array+0x120/0x120 [176041.174373] [<ffffffff8192a605>] ? x86_64_start_kernel+0x148/0x157 [176041.174374] ---[ end trace 0ed75ece92f6e55f ]--- Comment 8 here: https://bugzilla.redhat.com/show_bug.cgi?id=1183087#c8 has another backtrace on 4.0.4 that appears to be the same. This post from 임성택 also describes the problem and provides a workaround patch: http://lists.freedesktop.org/archives/nouveau/2015-March/020421.html On kernel 4.2.6, I got a similar crash. The CPU locked up. The machine was doing some heavy computation on many processors. After removing the nouveau module, the problem went away. Dec 06 18:51:50 bullseye kernel: WARNING: CPU: 0 PID: 19974 at kernel/watchdog.c:338 watchdog_overflow_callback+0x79/0xa0() Dec 06 18:51:50 bullseye kernel: Watchdog detected hard LOCKUP on cpu 0 Dec 06 18:51:50 bullseye kernel: Modules linked in: Dec 06 18:51:50 bullseye kernel: ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 Dec 06 18:51:50 bullseye kernel: crc32_pclmul crc32c_intel drm serio_raw uas usb_storage hpsa ata_generic pata_acpi wmi Dec 06 18:51:50 bullseye kernel: CPU: 0 PID: 19974 Comm: redacted Not tainted 4.2.6-201.fc22.x86_64 #1 Dec 06 18:51:50 bullseye kernel: Hardware name: HP ProLiant DL360p Gen8, BIOS P71 07/01/2015 Dec 06 18:51:50 bullseye kernel: 0000000000000000 000000009a975639 ffff880fff605aa0 ffffffff817729ea Dec 06 18:51:50 bullseye kernel: 0000000000000000 ffff880fff605af8 ffff880fff605ae0 ffffffff8109e4b6 Dec 06 18:51:50 bullseye kernel: 0000000000000000 ffff881ff92ae000 0000000000000000 ffff880fff605c00 Dec 06 18:51:50 bullseye kernel: Call Trace: Dec 06 18:51:50 bullseye kernel: <NMI> [<ffffffff817729ea>] dump_stack+0x45/0x57 Dec 06 18:51:50 bullseye kernel: [<ffffffff8109e4b6>] warn_slowpath_common+0x86/0xc0 Dec 06 18:51:50 bullseye kernel: [<ffffffff8109e545>] warn_slowpath_fmt+0x55/0x70 Dec 06 18:51:50 bullseye kernel: [<ffffffff81153029>] watchdog_overflow_callback+0x79/0xa0 Dec 06 18:51:50 bullseye kernel: [<ffffffff81197f50>] __perf_event_overflow+0x90/0x1c0 Dec 06 18:51:50 bullseye kernel: [<ffffffff81198b54>] perf_event_overflow+0x14/0x20 Dec 06 18:51:50 bullseye kernel: [<ffffffff81033af7>] intel_pmu_handle_irq+0x1e7/0x470 Dec 06 18:51:50 bullseye kernel: [<ffffffff8102a3e6>] perf_event_nmi_handler+0x26/0x40 Dec 06 18:51:50 bullseye kernel: [<ffffffff810188b3>] nmi_handle+0x83/0x120 Dec 06 18:51:50 bullseye kernel: [<ffffffff81018df2>] default_do_nmi+0x42/0xf0 Dec 06 18:51:50 bullseye kernel: [<ffffffff81018f8a>] do_nmi+0xea/0x140 Dec 06 18:51:50 bullseye kernel: [<ffffffff8177b701>] end_repeat_nmi+0x1a/0x1e Dec 06 18:51:50 bullseye kernel: [<ffffffff810e677c>] ? queued_spin_lock_slowpath+0x15c/0x170 Dec 06 18:51:50 bullseye kernel: [<ffffffff810e677c>] ? queued_spin_lock_slowpath+0x15c/0x170 Dec 06 18:51:50 bullseye kernel: [<ffffffff810e677c>] ? queued_spin_lock_slowpath+0x15c/0x170 Dec 06 18:51:50 bullseye kernel: <<EOE>> <IRQ> [<ffffffff817791df>] _raw_spin_lock_irqsave+0x3f/0x50 Dec 06 18:51:50 bullseye kernel: [<ffffffffa0218cbe>] nvkm_fantog_update+0x4e/0x120 [nouveau] Dec 06 18:51:50 bullseye kernel: [<ffffffffa0218de5>] nvkm_fantog_set+0x35/0x40 [nouveau] Dec 06 18:51:50 bullseye kernel: [<ffffffffa02182cc>] nvkm_fan_update+0xec/0x1e0 [nouveau] Dec 06 18:51:50 bullseye kernel: [<ffffffffa02183f9>] nvkm_therm_fan_set+0x19/0x20 [nouveau] Dec 06 18:51:50 bullseye kernel: [<ffffffffa0217bfc>] nvkm_therm_update+0x11c/0x2d0 [nouveau] Dec 06 18:51:50 bullseye kernel: [<ffffffffa0217dca>] nvkm_therm_alarm+0x1a/0x20 [nouveau] Dec 06 18:51:50 bullseye kernel: [<ffffffffa021b472>] nv04_timer_alarm_trigger+0x122/0x170 [nouveau] Dec 06 18:51:50 bullseye kernel: [<ffffffffa021b521>] nv04_timer_alarm+0x61/0xc0 [nouveau] Dec 06 18:51:50 bullseye kernel: [<ffffffffa0218d82>] nvkm_fantog_update+0x112/0x120 [nouveau] Dec 06 18:51:50 bullseye kernel: [<ffffffffa0218daa>] nvkm_fantog_alarm+0x1a/0x20 [nouveau] Dec 06 18:51:50 bullseye kernel: [<ffffffffa021b472>] nv04_timer_alarm_trigger+0x122/0x170 [nouveau] Dec 06 18:51:50 bullseye kernel: [<ffffffffa021b5e2>] nv04_timer_intr+0x62/0x80 [nouveau] Dec 06 18:51:50 bullseye kernel: [<ffffffffa021211b>] nvkm_mc_intr+0xfb/0x170 [nouveau] Dec 06 18:51:50 bullseye kernel: [<ffffffff810f5ff4>] handle_irq_event_percpu+0x74/0x180 Dec 06 18:51:50 bullseye kernel: [<ffffffff810f6130>] handle_irq_event+0x30/0x60 Dec 06 18:51:50 bullseye kernel: [<ffffffff810f944f>] handle_edge_irq+0x6f/0x130 Dec 06 18:51:50 bullseye kernel: [<ffffffff81016e62>] handle_irq+0x72/0x120 Dec 06 18:51:50 bullseye kernel: [<ffffffff8177c01f>] do_IRQ+0x4f/0xe0 Dec 06 18:51:50 bullseye kernel: [<ffffffff81779f2b>] common_interrupt+0x6b/0x6b Dec 06 18:51:50 bullseye kernel: <EOI> Dec 06 18:51:50 bullseye kernel: ---[ end trace c72347df4d25d0c7 ]--- Dec 06 18:51:50 bullseye kernel: INFO: rcu_sched detected stalls on CPUs/tasks: { 0} (detected by 17, t=60002 jiffies, g=184581, c=184580, q=0) Dec 06 18:51:50 bullseye kernel: Task dump for CPU 0: Dec 06 18:51:50 bullseye kernel: redacted R running task 0 19974 3145 0x00000008 Dec 06 18:51:50 bullseye kernel: ffff8819589e7b98 ffff881ff7e69800 ffffffffffffff74 0000000000000028 Dec 06 18:51:50 bullseye kernel: 0000000000000000 0000000000000020 ffff8819589e7b98 ffffffff8164869c Dec 06 18:51:50 bullseye kernel: 0000000000000000 ffff881ff7e69800 ffff8819589e7bc8 ffffffff816c1c80 Dec 06 18:51:50 bullseye kernel: Call Trace: Dec 06 18:51:50 bullseye kernel: [<ffffffff8164869c>] ? sk_reset_timer+0x1c/0x30 Dec 06 18:51:50 bullseye kernel: [<ffffffff816c1c80>] tcp_send_delayed_ack+0x100/0x130 Dec 06 18:51:50 bullseye kernel: [<ffffffff816b36cd>] __tcp_ack_snd_check+0x6d/0x90 Dec 06 18:51:50 bullseye kernel: [<ffffffff816bb4fc>] tcp_rcv_established+0x4cc/0x780 Dec 06 18:51:50 bullseye kernel: [<ffffffff813aeac9>] ? copy_to_iter+0x79/0x260 Dec 06 18:51:50 bullseye kernel: [<ffffffff81778f2a>] ? _raw_write_unlock_bh+0x1a/0x20 Dec 06 18:51:50 bullseye kernel: [<ffffffff81778f3e>] ? _raw_spin_unlock_bh+0xe/0x10 Dec 06 18:51:50 bullseye kernel: [<ffffffff81648f66>] ? release_sock+0x106/0x150 Dec 06 18:51:50 bullseye kernel: [<ffffffff810d28af>] ? numa_migrate_preferred+0x2f/0x90 Dec 06 18:51:50 bullseye kernel: [<ffffffff810d686d>] ? task_numa_fault+0x7bd/0xae0 Dec 06 18:51:50 bullseye kernel: [<ffffffff811d5f81>] ? handle_mm_fault+0xb81/0x17d0 Dec 06 18:51:50 bullseye kernel: [<ffffffff816447cb>] ? sock_recvmsg+0x3b/0x50 Dec 06 18:51:50 bullseye kernel: [<ffffffff81644a16>] ? SYSC_recvfrom+0xd6/0x150 Dec 06 18:51:50 bullseye kernel: [<ffffffff81065454>] ? __do_page_fault+0x1b4/0x400 Dec 06 18:51:50 bullseye kernel: [<ffffffff81774d11>] ? __schedule+0x371/0x950 Dec 06 18:51:50 bullseye kernel: [<ffffffff810656cf>] ? do_page_fault+0x2f/0x80 Dec 06 18:51:50 bullseye kernel: [<ffffffff8177b378>] ? page_fault+0x28/0x30 Is this actually an old bug or a resurrected bug? Someone had a similar backtrace dating back to 2013: https://bbs.archlinux.org/viewtopic.php?id=164714 And now I'm getting the same issue after adding a GK208B and a (fanless) NV34 to my system. khugepaged gets stuck like so: [83695.847012] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [khugepaged:59] [83695.847015] Modules linked in: rtl8xxxu mac80211 cfg80211 it87 hwmon_vid nouveau uas usb_storage fbcon video bitblit softcursor font i2c_algo_bit ttm drm_kms_helper cfbfillrect syscopyarea cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea drm backlight fb fbdev mxm_wmi wmi [83695.847032] CPU: 0 PID: 59 Comm: khugepaged Tainted: G I L 4.6.0+ #2 [83695.847033] Hardware name: Gigabyte Technology Co., Ltd. EX58-UD3R/EX58-UD3R, BIOS FB 05/04/2009 [83695.847035] task: ffff8801d822b000 ti: ffff8801d8314000 task.ti: ffff8801d8314000 [83695.847036] RIP: 0010:[<ffffffff810e9fef>] [<ffffffff810e9fef>] smp_call_function_many+0x1de/0x1f1 [83695.847042] RSP: 0018:ffff8801d8317be0 EFLAGS: 00000202 [83695.847044] RAX: 0000000000000005 RBX: ffff8801dfc169c8 RCX: 0000000000000005 [83695.847045] RDX: ffff8801dfd59328 RSI: 0000000000000008 RDI: ffff8801dfc169c8 [83695.847046] RBP: ffff8801d8317c20 R08: 0000000000000005 R09: 0000000000000000 [83695.847047] R10: 000000000000175d R11: 0000000000000009 R12: ffff8801dfc169c0 [83695.847049] R13: 0000000000016980 R14: 0000000000000001 R15: 0000000000000008 [83695.847050] FS: 0000000000000000(0000) GS:ffff8801dfc00000(0000) knlGS:0000000000000000 [83695.847051] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [83695.847053] CR2: 00007f2c7ed83000 CR3: 0000000001c07000 CR4: 00000000000006f0 [83695.847054] Stack: [83695.847055] 0100000000000000 0000000000000000 ffffffff8114d0c8 0000000000000000 [83695.847057] ffffffff8114d0c8 0000000000000000 ffffffff81f31bc8 0000000000000009 [83695.847059] ffff8801d8317c50 ffffffff810ea0a5 0000000000000008 0000000000000000 [83695.847061] Call Trace: [83695.847065] [<ffffffff8114d0c8>] ? page_alloc_cpu_notify+0x41/0x41 [83695.847067] [<ffffffff8114d0c8>] ? page_alloc_cpu_notify+0x41/0x41 [83695.847068] [<ffffffff810ea0a5>] on_each_cpu_mask+0x28/0x48 [83695.847070] [<ffffffff8114d5f2>] drain_all_pages+0x94/0xbb [83695.847073] [<ffffffff8114fa07>] __alloc_pages_nodemask+0x5f3/0x8b0 [83695.847075] [<ffffffff81173b64>] ? __page_set_anon_rmap+0x31/0x7d [83695.847079] [<ffffffff8118a919>] __alloc_pages_node.isra.57+0x12/0x14 [83695.847081] [<ffffffff8118ae09>] khugepaged+0xcc/0x10d1 [83695.847084] [<ffffffff810bdebe>] ? finish_wait+0x62/0x62 [83695.847086] [<ffffffff8118ad3d>] ? maybe_pmd_mkwrite+0x1a/0x1a [83695.847089] [<ffffffff810a8f93>] kthread+0xa5/0xad [83695.847093] [<ffffffff81766492>] ret_from_fork+0x22/0x40 [83695.847094] [<ffffffff810a8eee>] ? init_completion+0x24/0x24 [83695.847095] Code: 74 2d 48 89 de 89 c7 e8 f8 f9 ff ff 3b 05 4e c1 c3 00 7d 1b 48 63 c8 49 8b 14 24 48 03 14 cd c0 4a d2 81 f6 42 18 01 74 04 f3 90 <eb> f6 eb d3 48 83 c4 18 5b 41 5c 41 5d 41 5e 41 5f 5d c3 66 66 But looking at all the active CPUs, there's always one with [83696.530737] NMI backtrace for cpu 5 [83696.530737] CPU: 5 PID: 0 Comm: swapper/5 Tainted: G I L 4.6.0+ #2 [83696.530739] Hardware name: Gigabyte Technology Co., Ltd. EX58-UD3R/EX58-UD3R, BIOS FB 05/04/2009 [83696.530740] task: ffff8801d8a3b000 ti: ffff8801d8a50000 task.ti: ffff8801d8a50000 [83696.530741] RIP: 0010:[<ffffffff810bff24>] [<ffffffff810bff24>] queued_spin_lock_slowpath+0x59/0x173 [83696.530742] RSP: 0018:ffff8801dfd43b98 EFLAGS: 00000002 [83696.530743] RAX: 0000000000000101 RBX: 0000000000000086 RCX: 0000000000000101 [83696.530744] RDX: 0000000000000100 RSI: 0000000000000001 RDI: ffff8801d75ff108 [83696.530745] RBP: ffff8801dfd43b98 R08: 0000000000000001 R09: ffff8801dfd43d53 [83696.530746] R10: ffff8801dfd43d27 R11: 0000000000000000 R12: ffff8801d75ff108 [83696.530747] R13: ffff8801d6ede000 R14: ffff8801d75ff000 R15: ffff8801d528d800 [83696.530748] FS: 0000000000000000(0000) GS:ffff8801dfd40000(0000) knlGS:0000000000000000 [83696.530749] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [83696.530750] CR2: 00000000004330c1 CR3: 0000000001c07000 CR4: 00000000000006e0 [83696.530751] Stack: [83696.530752] ffff8801dfd43bb0 ffffffff817661b4 0000000000000029 ffff8801dfd43c00 [83696.530753] ffffffffc03489b7 0000000000000010 ffff8801d7707400 ffff8801d528d700 [83696.530754] ffff8801d75ff000 0000000000000029 ffff8801d6ede000 0000000000000029 [83696.530755] Call Trace: [83696.530756] <IRQ> d [<ffffffff817661b4>] _raw_spin_lock_irqsave+0x23/0x29 [83696.530757] [<ffffffffc03489b7>] nvkm_fantog_update+0x43/0x103 [nouveau] [83696.530758] [<ffffffffc0348ac9>] nvkm_fantog_set+0x38/0x3f [nouveau] [83696.530759] [<ffffffffc034818d>] nvkm_fan_update+0x12c/0x1a7 [nouveau] [83696.530760] [<ffffffffc0348255>] nvkm_therm_fan_set+0x19/0x1b [nouveau] [83696.530761] [<ffffffffc0347c5a>] nvkm_therm_update+0x223/0x230 [nouveau] [83696.530762] [<ffffffffc0347c7c>] nvkm_therm_alarm+0x15/0x17 [nouveau] [83696.530763] [<ffffffffc034a837>] nvkm_timer_alarm_trigger+0xde/0xf6 [nouveau] [83696.530764] [<ffffffffc034a943>] nvkm_timer_alarm+0xaa/0xb3 [nouveau] [83696.530765] [<ffffffffc0348a5c>] nvkm_fantog_update+0xe8/0x103 [nouveau] [83696.530766] [<ffffffffc0348a8f>] nvkm_fantog_alarm+0x18/0x1a [nouveau] [83696.530767] [<ffffffffc034a837>] nvkm_timer_alarm_trigger+0xde/0xf6 [nouveau] [83696.530768] [<ffffffffc034abb3>] nv04_timer_intr+0x39/0x9f [nouveau] [83696.530769] [<ffffffffc034a71f>] nvkm_timer_intr+0x14/0x16 [nouveau] [83696.530770] [<ffffffffc03115d8>] nvkm_subdev_intr+0x17/0x19 [nouveau] [83696.530771] [<ffffffffc033f5fb>] nvkm_mc_intr+0x81/0xd2 [nouveau] [83696.530772] [<ffffffffc0342efa>] nvkm_pci_intr+0x4a/0x5c [nouveau] [83696.530773] [<ffffffff810cbc26>] handle_irq_event_percpu+0x6c/0x196 [83696.530773] [<ffffffff810cbd7b>] handle_irq_event+0x2b/0x4b [83696.530774] [<ffffffff810ce999>] handle_edge_irq+0xa6/0xc3 [83696.530775] [<ffffffff8105841b>] handle_irq+0x109/0x111 [83696.530776] [<ffffffff8176859b>] do_IRQ+0x4b/0xba [83696.530777] [<ffffffff81766bbf>] common_interrupt+0x7f/0x7f [83696.530778] <EOI> d [<ffffffff8157911f>] ? cpuidle_enter_state+0x103/0x15b [83696.530779] [<ffffffff815791a3>] cpuidle_enter+0x17/0x19 [83696.530780] [<ffffffff810be61e>] cpu_startup_entry+0x192/0x1fd [83696.530781] [<ffffffff8106f2ff>] start_secondary+0xe0/0xe3 [83696.530782] Code: ff ff 75 33 83 fe 01 89 ca 89 f0 41 0f 45 d0 f0 0f b1 17 39 f0 74 04 89 c6 eb e1 ff ca 0f 84 20 01 00 00 8b 07 84 c0 74 04 f3 90 <eb> f6 66 c7 07 01 00 e9 0c 01 00 00 48 c7 c0 40 65 01 00 65 48 Since it's in handle_irq, I assume that that takes some lock which basically prevents the system from proceeding (except for NMIs). I can ssh into the system, and even run some things (like these traces), but khugepaged is at 100%, and some other processes tend to hang. My VBIOS is to follow shortly. Last I debugged this, I determined that this could happen if some time interval we computed turned out to be 0, causing the timer to be executed from the interrupt. Created attachment 124024 [details] GK208B VBIOS I added this card recently. There's also a GT215 which has been in there for a while (https://people.freedesktop.org/~imirkin/traces/nva3/nva3-gddr5.rom) and a fanless NV34 (https://people.freedesktop.org/~imirkin/traces/nv34-pci.rom). FWIW, I've been running with the workaround patch in comment 6 for about 9 months now and haven't noticed any problems. I hit the warning maybe 4 times a day. (In reply to Jim Paris from comment #11) > FWIW, I've been running with the workaround patch in comment 6 for about 9 > months now and haven't noticed any problems. I hit the warning maybe 4 > times a day. Thanks, patched that in, I now get the warnings too but it survives. Martin, please have a look at this - we shouldn't be allowing the fan update to be called from interrupt context at all in the first place. (In reply to Ilia Mirkin from comment #12) > (In reply to Jim Paris from comment #11) > > FWIW, I've been running with the workaround patch in comment 6 for about 9 > > months now and haven't noticed any problems. I hit the warning maybe 4 > > times a day. > > Thanks, patched that in, I now get the warnings too but it survives. Martin, > please have a look at this - we shouldn't be allowing the fan update to be > called from interrupt context at all in the first place. I agree that we should not. The issue here is that core/ is not allowed to depend on work items. At least, it was not supported before. Having a look right now! Created attachment 128843 [details] [review] Prevent recursion while processing timers I was having similar problem with daily crashes showing recursive calls to nvkm_timer_alarm_trigger. I've been stable for the last three weeks with the attached patch. It prevents the recursion by only processing the list when the timer expires, and not also processing it each time an item is updated. This patch is based off the Fedora 25 kernel 4.8.14 sources but should apply elsewhere. Ben took some time to debug what was going on, you can give https://cgit.freedesktop.org/~airlied/linux/log/?h=drm-next a shot. It contains the relevant commits. I haven't tested it myself yet though. (Also, for some reason, this issue has started to trigger less often on newer kernels for me.) -- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/xorg/driver/xf86-video-nouveau/issues/205. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.