Loading nouveau drm module causes big kernel problems. Trying to load X always results in hardlock. If I blacklist nouveau drm module with the same kernel, it is perfectly stable. * nouveau git kernel from 09/09 b4eaf6c7cf5621a7ccc556123ccab64811ab2e40 * ddx git from 09/09 * X.Org X Server 1.6.3.901 (1.6.4 RC 1) * libdrm 2.4.13 * 01:00.0 VGA compatible controller: nVidia Corporation G84 [GeForce 8600M GT] (rev a1) [ 281.450035] nouveau 0000:01:00.0: nouveau_channel_free: freeing fifo 2 [ 281.463089] nouveau 0000:01:00.0: nouveau_channel_free: freeing fifo 1 [ 281.486950] [TTM] Zone kernel: Used memory at exit: 0 kiB. [ 281.486956] [TTM] Zone highmem: Used memory at exit: 0 kiB. [ 288.525197] ------------[ cut here ]------------ [ 288.525215] WARNING: at block/cfq-iosched.c:1771 cfq_cic_lookup+0x61/0xce() [ 288.525222] Hardware name: XPS M1530 [ 288.525226] Modules linked in: fuse nouveau ttm drm_kms_helper drm agpgart i2c_algo_bit iwl3945 iwlcore mac80211 cfg80211 ohci1394 ieee1394 btusb bluetooth rfkill sdhci_pci sdhci mmc_core ricoh_mmc sky2 cpufreq_conservative acpi_cpufreq freq_table loop [ 288.525279] Pid: 2982, comm: hald-addon-stor Not tainted 2.6.31-rc8 #1 [ 288.525284] Call Trace: [ 288.525296] [<c0225a00>] warn_slowpath_common+0x65/0x7c [ 288.525305] [<c03ada74>] ? cfq_cic_lookup+0x61/0xce [ 288.525313] [<c0225a24>] warn_slowpath_null+0xd/0x10 [ 288.525321] [<c03ada74>] cfq_cic_lookup+0x61/0xce [ 288.525330] [<c03adc27>] cfq_may_queue+0x1e/0xab [ 288.525338] [<c039d892>] elv_may_queue+0x14/0x1b [ 288.525348] [<c03a291f>] get_request+0x24/0x253 [ 288.525356] [<c03a300e>] get_request_wait+0x18/0x12f [ 288.525366] [<c0239077>] ? hrtimer_wakeup+0x0/0x1c [ 288.525375] [<c0236d61>] ? remove_wait_queue+0x31/0x36 [ 288.525383] [<c03a315c>] blk_get_request+0x37/0x58 [ 288.525394] [<c0428cd1>] scsi_execute+0x23/0x11b [ 288.525403] [<c0428e1b>] scsi_execute_req+0x52/0x7f [ 288.525412] [<c0433db0>] sr_test_unit_ready+0x41/0x99 [ 288.525420] [<c04347c8>] sr_media_change+0x37/0x1e3 [ 288.525430] [<c044d459>] media_changed+0x43/0x72 [ 288.525438] [<c044d4b0>] cdrom_media_changed+0x28/0x2e [ 288.525446] [<c0433e4f>] sr_block_media_changed+0x11/0x13 [ 288.525456] [<c02b1e7d>] check_disk_change+0x19/0x42 [ 288.525464] [<c044f23f>] cdrom_open+0x794/0x7f7 [ 288.525473] [<c0423ba6>] ? scsi_done+0x0/0xd [ 288.525480] [<c043f4d1>] ? atapi_xlat+0x0/0x15f [ 288.525491] [<c05bfe39>] ? schedule_timeout+0x17/0xbd [ 288.525499] [<c0423ba6>] ? scsi_done+0x0/0xd [ 288.525508] [<c03b0907>] ? kobject_put+0x37/0x3c [ 288.525518] [<c041bf8a>] ? put_device+0xf/0x11 [ 288.525526] [<c04280a2>] ? scsi_request_fn+0x326/0x3c6 [ 288.525536] [<c0380536>] ? avc_has_perm+0x3c/0x46 [ 288.525545] [<c0380536>] ? avc_has_perm+0x3c/0x46 [ 288.525553] [<c03b09e8>] ? kobject_get+0x12/0x17 [ 288.525562] [<c03a88cf>] ? get_disk+0x39/0x50 [ 288.525571] [<c03a88f0>] ? exact_lock+0xa/0x11 [ 288.525579] [<c03b09e8>] ? kobject_get+0x12/0x17 [ 288.525587] [<c041c01a>] ? get_device+0x13/0x18 [ 288.525595] [<c0433f8a>] sr_block_open+0x6d/0x82 [ 288.525603] [<c02b27a0>] __blkdev_get+0x7f/0x289 [ 288.525612] [<c02b29b4>] blkdev_get+0xa/0xc [ 288.525619] [<c02b2a15>] blkdev_open+0x5f/0x8b [ 288.525629] [<c0292e84>] __dentry_open+0x111/0x1f4 [ 288.525638] [<c0293005>] nameidata_to_filp+0x2d/0x42 [ 288.525646] [<c02b29b6>] ? blkdev_open+0x0/0x8b [ 288.525655] [<c029cb6a>] do_filp_open+0x3b2/0x6a4 [ 288.525663] [<c044e0dd>] ? cdrom_release+0x16c/0x19f [ 288.525671] [<c03b0907>] ? kobject_put+0x37/0x3c [ 288.525680] [<c029069b>] ? __slab_alloc+0x9a/0x425 [ 288.525689] [<c0290af8>] ? kmem_cache_alloc+0x57/0xd9 [ 288.525697] [<c029cebb>] ? getname+0x1b/0xb9 [ 288.525706] [<c02a4787>] ? alloc_fd+0x53/0xb8 [ 288.525715] [<c0292c7d>] do_sys_open+0x48/0xdf [ 288.525724] [<c0292d56>] sys_open+0x1e/0x26 [ 288.525733] [<c0202984>] sysenter_do_call+0x12/0x22 [ 288.525739] ---[ end trace 1946ff4265fe09ba ]--- [ 288.525759] ------------[ cut here ]------------ [ 288.526043] kernel BUG at block/cfq-iosched.c:1775! [ 288.526104] invalid opcode: 0000 [#1] SMP
Created attachment 29389 [details] xorg log
Created attachment 29390 [details] kernel dmesg
Created attachment 29391 [details] kernel config
After a git bisect and 10 reboot, I found the offending commit : http://cgit.freedesktop.org/nouveau/linux-2.6/commit/?id=9d8ae0c84726807c43b7230b8e6c2d079c33b81d reverting that line works perfectly here :) I just regret I first tried nouveau-git right after that commit went in. If I had tried just before, it would have been easy to find the regression.
sorry, disregard the last information. The problem is that there is some randomness involved. The kernel booted fine twice in a row, and then everything worked fine. But rebooting another time triggered the same failure again.
Same here. Same distribution. Different card. I have a gf6200TC. It happens to me only when I enable KMS. Then it took me 8 boots to get it up. 7times it crashed similar. With KMS disabled this never happened and Xorg starts fine. Then there are problems shutting down Xorg process but that's probably something different. How can we help to track this down? stable kernel 2.6.31, nouveau from master daily snapshot 20090909. gcc4.4.1
From what has been reported here, this looks like a race in a completely irrelevant (to nouveau) part of the kernel. It happens randomly, more often when nouveau.ko KMS is loaded early, and less often when nouveau.ko loads later. AFAIK no-one has yet managed to hit it completely without Nouveau, which is somewhat concerning. My guess is that nouveau alters the kernel code execution timings and makes a race elsewhere a lot more failure prone. Some suggestions: - search kernel mailing lists for CFQ bug reports (since it seems to happen in CFQ code) - try to reproduce it without nouveau, e.g., try an old FB driver - try enabling heavy debugging features in the kernel config, like kmemcheck or kmemleak I'm trying to steer this away from Nouveau first, because it looks very difficult to debug. In the future, please provide full kernel logs from boot, and set attachment type to text/plain.
Some new information : * CONFIG_DEBUG_SPINLOCK (spinlock and rw lock debugging : basic checks) works around the problem (it leads to very stable nouveau kms) * maxcpus=1 does not workaround the problem * traces are not always in cfq code * traces only happen when nouveau is loaded * other framebuffer drivers work fine (vesafb, uvesafb, nvidiafb) * crash are (almost?) always with kms on * crash can happen both at boot or when reloading module * crash seem more frequent when nouveau is loaded by udev or in early boot, rather than loaded with X * happen both with 2.6.31-rc8 (nouveau-git) and 2.6.31 stable
Created attachment 29577 [details] crash when reloading the module (2.6.31 stable / master-compat 20090909)
Created attachment 29578 [details] crash when reloading the module (nouveau-git kernel from today - 2.6.31-rc8)
* building a kernel without highmem does not help * booting with nosmp does not help
Disabling PAT on latest nouveau git kernel does not help. 2.6.31.1 kernel (+ master-compat snapshopt from today) does not help.
So still using nouveau kernel (2.6.31-rc9 now). When nouveau is loaded during boot, it seems to crash 100% of the times, with a trace similar to the one here : https://bugs.freedesktop.org/attachment.cgi?id=29578 That is : BUG: unable to handle kernel NULL pointer dereference at 00000003 IP: [<c027e452>] kmem_cache_alloc+0x62/0xda When I boot with noacpi, the first initialization does not crash for some reasons... But as soon as I reload the module, it crashes, in random places of the code. I will attach two dmesg showing that.
Created attachment 29924 [details] crash when reloading the module (nouveau-git 2.6.31-rc9 booted with noacpi)
Created attachment 29925 [details] another crash when reloading the module (nouveau-git 2.6.31-rc9 booted with noacpi)
Created attachment 29949 [details] [review] replacement for DEBUG_SPINLOCK? Could you try if this patch has the same effect as enabling DEBUG_SPINLOCK? If it does, could you try to reduce the patch to a minimal set of changes that does the same.
Created attachment 29950 [details] ttm trace log
I have a x86_64 system and enabled DEBUG_SPINLOCK but it didn't solve anything for me. I enabled the debug option for the drm module and when udev starts it shows Sep 29 16:59:20 workstation64 kernel: BUG: unable to handle kernel NULL pointer dereference at 00000000000000cc Sep 29 16:59:20 workstation64 kernel: IP: [<ffffffff811f726b>] _raw_spin_lock+0x2b/0x190 Sep 29 16:59:20 workstation64 kernel: PGD 22e59f067 PUD 22e60a067 PMD 0 Sep 29 16:59:20 workstation64 kernel: Oops: 0000 [#1] PREEMPT SMP Sep 29 16:59:20 workstation64 kernel: last sysfs file: /sys/module/evdev/initstate Sep 29 16:59:20 workstation64 kernel: CPU 3 Sep 29 16:59:20 workstation64 kernel: Modules linked in: ttm(+) serio_raw drm_kms_helper thermal sg mii pcspkr soundcore snd_page_alloc evdev ehci_hcd(+) drm i2c_i801 button iTCO_wdt i2c_algo_bit iTCO_vendor_su pport i2c_core processor usbcore intel_agp rtc_cmos rtc_core rtc_lib ext4 mbcache jbd2 crc16 sd_mod ahci libata scsi_mod Sep 29 16:59:20 workstation64 kernel: Pid: 601, comm: modprobe Not tainted 2.6.31-ARCH #1 OEM Sep 29 16:59:20 workstation64 kernel: RIP: 0010:[<ffffffff811f726b>] [<ffffffff811f726b>] _raw_spin_lock+0x2b/0x190 more in the log
Created attachment 29962 [details] [review] [PATCH] drm/nouveau: fix missized allocation for ttm_bo_global struct This patch from http://0x04.net/~mwk/0001-drm-nouveau-fix-missized-allocation-for-ttm_bo_glob.patch fixes it, right?
Yes, it fixes it for Xavier (i686) and me (x86_64).
Seems to work for everyone involved, fix has been commited to git. Closing.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.