Bug 23847

Summary:

kernel BUG when using nouveau

Product:

xorg

Reporter:

Xavier <shiningxc>

Component:

Driver/nouveau

Assignee:

Nouveau Project <nouveau>

Status:

RESOLVED FIXED

QA Contact:

Xorg Project Team <xorg-team>

Severity:

normal

Priority:

medium

CC:

andyrtr

Version:

7.4 (2008.09)

Hardware:

Other

OS:

All

Whiteboard:

i915 platform:

i915 features:

Attachments:

Description	Flags
xorg log	none
kernel dmesg	none
kernel config	none
crash when reloading the module (2.6.31 stable / master-compat 20090909)	none
crash when reloading the module (nouveau-git kernel from today - 2.6.31-rc8)	none
crash when reloading the module (nouveau-git 2.6.31-rc9 booted with noacpi)	none
another crash when reloading the module (nouveau-git 2.6.31-rc9 booted with noacpi)	none
replacement for DEBUG_SPINLOCK?	none
ttm trace log	none
[PATCH] drm/nouveau: fix missized allocation for ttm_bo_global struct	none

Description Xavier 2009-09-10 10:48:44 UTC

Loading nouveau drm module causes big kernel problems. Trying to load X always results in hardlock.
If I blacklist nouveau drm module with the same kernel, it is perfectly stable.

* nouveau git kernel from 09/09 b4eaf6c7cf5621a7ccc556123ccab64811ab2e40
* ddx git from 09/09
* X.Org X Server 1.6.3.901 (1.6.4 RC 1)
* libdrm 2.4.13
* 01:00.0 VGA compatible controller: nVidia Corporation G84 [GeForce 8600M GT] (rev a1)

[  281.450035] nouveau 0000:01:00.0: nouveau_channel_free: freeing fifo 2
[  281.463089] nouveau 0000:01:00.0: nouveau_channel_free: freeing fifo 1
[  281.486950] [TTM] Zone  kernel: Used memory at exit: 0 kiB.
[  281.486956] [TTM] Zone highmem: Used memory at exit: 0 kiB.
[  288.525197] ------------[ cut here ]------------
[  288.525215] WARNING: at block/cfq-iosched.c:1771 cfq_cic_lookup+0x61/0xce()
[  288.525222] Hardware name: XPS M1530                       
[  288.525226] Modules linked in: fuse nouveau ttm drm_kms_helper drm agpgart i2c_algo_bit iwl3945 iwlcore mac80211 cfg80211 ohci1394 ieee1394 btusb bluetooth rfkill sdhci_pci sdhci mmc_core ricoh_mmc sky2 cpufreq_conservative acpi_cpufreq freq_table loop
[  288.525279] Pid: 2982, comm: hald-addon-stor Not tainted 2.6.31-rc8 #1
[  288.525284] Call Trace:
[  288.525296]  [<c0225a00>] warn_slowpath_common+0x65/0x7c
[  288.525305]  [<c03ada74>] ? cfq_cic_lookup+0x61/0xce
[  288.525313]  [<c0225a24>] warn_slowpath_null+0xd/0x10
[  288.525321]  [<c03ada74>] cfq_cic_lookup+0x61/0xce
[  288.525330]  [<c03adc27>] cfq_may_queue+0x1e/0xab
[  288.525338]  [<c039d892>] elv_may_queue+0x14/0x1b
[  288.525348]  [<c03a291f>] get_request+0x24/0x253
[  288.525356]  [<c03a300e>] get_request_wait+0x18/0x12f
[  288.525366]  [<c0239077>] ? hrtimer_wakeup+0x0/0x1c
[  288.525375]  [<c0236d61>] ? remove_wait_queue+0x31/0x36
[  288.525383]  [<c03a315c>] blk_get_request+0x37/0x58
[  288.525394]  [<c0428cd1>] scsi_execute+0x23/0x11b
[  288.525403]  [<c0428e1b>] scsi_execute_req+0x52/0x7f
[  288.525412]  [<c0433db0>] sr_test_unit_ready+0x41/0x99
[  288.525420]  [<c04347c8>] sr_media_change+0x37/0x1e3
[  288.525430]  [<c044d459>] media_changed+0x43/0x72
[  288.525438]  [<c044d4b0>] cdrom_media_changed+0x28/0x2e
[  288.525446]  [<c0433e4f>] sr_block_media_changed+0x11/0x13
[  288.525456]  [<c02b1e7d>] check_disk_change+0x19/0x42
[  288.525464]  [<c044f23f>] cdrom_open+0x794/0x7f7
[  288.525473]  [<c0423ba6>] ? scsi_done+0x0/0xd
[  288.525480]  [<c043f4d1>] ? atapi_xlat+0x0/0x15f
[  288.525491]  [<c05bfe39>] ? schedule_timeout+0x17/0xbd
[  288.525499]  [<c0423ba6>] ? scsi_done+0x0/0xd
[  288.525508]  [<c03b0907>] ? kobject_put+0x37/0x3c
[  288.525518]  [<c041bf8a>] ? put_device+0xf/0x11
[  288.525526]  [<c04280a2>] ? scsi_request_fn+0x326/0x3c6
[  288.525536]  [<c0380536>] ? avc_has_perm+0x3c/0x46
[  288.525545]  [<c0380536>] ? avc_has_perm+0x3c/0x46
[  288.525553]  [<c03b09e8>] ? kobject_get+0x12/0x17
[  288.525562]  [<c03a88cf>] ? get_disk+0x39/0x50
[  288.525571]  [<c03a88f0>] ? exact_lock+0xa/0x11
[  288.525579]  [<c03b09e8>] ? kobject_get+0x12/0x17
[  288.525587]  [<c041c01a>] ? get_device+0x13/0x18
[  288.525595]  [<c0433f8a>] sr_block_open+0x6d/0x82
[  288.525603]  [<c02b27a0>] __blkdev_get+0x7f/0x289
[  288.525612]  [<c02b29b4>] blkdev_get+0xa/0xc
[  288.525619]  [<c02b2a15>] blkdev_open+0x5f/0x8b
[  288.525629]  [<c0292e84>] __dentry_open+0x111/0x1f4
[  288.525638]  [<c0293005>] nameidata_to_filp+0x2d/0x42
[  288.525646]  [<c02b29b6>] ? blkdev_open+0x0/0x8b
[  288.525655]  [<c029cb6a>] do_filp_open+0x3b2/0x6a4
[  288.525663]  [<c044e0dd>] ? cdrom_release+0x16c/0x19f
[  288.525671]  [<c03b0907>] ? kobject_put+0x37/0x3c
[  288.525680]  [<c029069b>] ? __slab_alloc+0x9a/0x425
[  288.525689]  [<c0290af8>] ? kmem_cache_alloc+0x57/0xd9
[  288.525697]  [<c029cebb>] ? getname+0x1b/0xb9
[  288.525706]  [<c02a4787>] ? alloc_fd+0x53/0xb8
[  288.525715]  [<c0292c7d>] do_sys_open+0x48/0xdf
[  288.525724]  [<c0292d56>] sys_open+0x1e/0x26
[  288.525733]  [<c0202984>] sysenter_do_call+0x12/0x22
[  288.525739] ---[ end trace 1946ff4265fe09ba ]---
[  288.525759] ------------[ cut here ]------------
[  288.526043] kernel BUG at block/cfq-iosched.c:1775!
[  288.526104] invalid opcode: 0000 [#1] SMP

Comment 1 Xavier 2009-09-10 10:49:08 UTC

Created attachment 29389 [details]
xorg log

Comment 2 Xavier 2009-09-10 10:49:31 UTC

Created attachment 29390 [details]
kernel dmesg

Comment 3 Xavier 2009-09-10 10:49:50 UTC

Created attachment 29391 [details]
kernel config

Comment 4 Xavier 2009-09-10 13:22:46 UTC

After a git bisect and 10 reboot, I found the offending commit :
http://cgit.freedesktop.org/nouveau/linux-2.6/commit/?id=9d8ae0c84726807c43b7230b8e6c2d079c33b81d

reverting that line works perfectly here :)

I just regret I first tried nouveau-git right after that commit went in. If I had tried just before, it would have been easy to find the regression.

Comment 5 Xavier 2009-09-10 13:45:43 UTC

sorry, disregard the last information.
The problem is that there is some randomness involved. The kernel booted fine twice in a row, and then everything worked fine. But rebooting another time triggered the same failure again.

Comment 6 Andreas Radke 2009-09-11 09:01:29 UTC

Same here. Same distribution. Different card. I have a gf6200TC. It happens to me only when I enable KMS. Then it took me 8 boots to get it up. 7times it crashed similar.

With KMS disabled this never happened and Xorg starts fine. Then there are problems shutting down Xorg process but that's probably something different.

How can we help to track this down?

stable kernel 2.6.31, nouveau from master daily snapshot 20090909. gcc4.4.1

Comment 7 Pekka Paalanen 2009-09-14 11:48:01 UTC

From what has been reported here, this looks like a race in a completely irrelevant (to nouveau) part of the kernel. It happens randomly, more often when nouveau.ko KMS is loaded early, and less often when nouveau.ko loads later. AFAIK no-one has yet managed to hit it completely without Nouveau, which is somewhat concerning. My guess is that nouveau alters the kernel code execution timings and makes a race elsewhere a lot more failure prone.

Some suggestions:
- search kernel mailing lists for CFQ bug reports (since it seems to happen in CFQ code)
- try to reproduce it without nouveau, e.g., try an old FB driver
- try enabling heavy debugging features in the kernel config, like kmemcheck or kmemleak

I'm trying to steer this away from Nouveau first, because it looks very difficult to debug.

In the future, please provide full kernel logs from boot, and set attachment type to text/plain.

Comment 8 Xavier 2009-09-15 14:13:39 UTC

Some new information :
* CONFIG_DEBUG_SPINLOCK (spinlock and rw lock debugging : basic checks) works around the problem (it leads to very stable nouveau kms)
* maxcpus=1 does not workaround the problem
* traces are not always in cfq code
* traces only happen when nouveau is loaded
* other framebuffer drivers work fine (vesafb, uvesafb, nvidiafb)
* crash are (almost?) always with kms on
* crash can happen both at boot or when reloading module
* crash seem more frequent when nouveau is loaded by udev or in early boot, rather than loaded with X
* happen both with 2.6.31-rc8 (nouveau-git) and 2.6.31 stable

Comment 9 Xavier 2009-09-15 14:15:31 UTC

Created attachment 29577 [details]
crash when reloading the module (2.6.31 stable / master-compat 20090909)

Comment 10 Xavier 2009-09-15 14:16:18 UTC

Created attachment 29578 [details]
crash when reloading the module (nouveau-git kernel from today - 2.6.31-rc8)

Comment 11 Xavier 2009-09-18 06:04:41 UTC

* building a kernel without highmem does not help
* booting with nosmp does not help

Comment 12 Xavier 2009-09-24 14:02:58 UTC

Disabling PAT on latest nouveau git kernel does not help.
2.6.31.1 kernel (+ master-compat snapshopt from today) does not help.

Comment 13 Xavier 2009-09-28 13:22:17 UTC

So still using nouveau kernel (2.6.31-rc9 now).

When nouveau is loaded during boot, it seems to crash 100% of the times, with a trace similar to the one here :
https://bugs.freedesktop.org/attachment.cgi?id=29578

That is :
BUG: unable to handle kernel NULL pointer dereference at 00000003
IP: [<c027e452>] kmem_cache_alloc+0x62/0xda

When I boot with noacpi, the first initialization does not crash for some reasons... But as soon as I reload the module, it crashes, in random places of the code. I will attach two dmesg showing that.

Comment 14 Xavier 2009-09-28 13:23:29 UTC

Created attachment 29924 [details]
crash when reloading the module (nouveau-git 2.6.31-rc9 booted with noacpi)

Comment 15 Xavier 2009-09-28 13:23:51 UTC

Created attachment 29925 [details]
another crash when reloading the module (nouveau-git 2.6.31-rc9 booted with noacpi)

Comment 16 Pekka Paalanen 2009-09-29 10:07:47 UTC

Created attachment 29949 [details] [review]
replacement for DEBUG_SPINLOCK?

Could you try if this patch has the same effect as enabling DEBUG_SPINLOCK?
If it does, could you try to reduce the patch to a minimal set of changes that does the same.

Comment 17 Andreas Radke 2009-09-29 10:30:50 UTC

Created attachment 29950 [details]
ttm trace log

Comment 18 Andreas Radke 2009-09-29 10:32:25 UTC

I have a x86_64 system and enabled DEBUG_SPINLOCK but it didn't solve anything for me.

I enabled the debug option for the drm module and when udev starts it shows 

Sep 29 16:59:20 workstation64 kernel: BUG: unable to handle kernel NULL pointer dereference at 00000000000000cc
Sep 29 16:59:20 workstation64 kernel: IP: [<ffffffff811f726b>] _raw_spin_lock+0x2b/0x190
Sep 29 16:59:20 workstation64 kernel: PGD 22e59f067 PUD 22e60a067 PMD 0 
Sep 29 16:59:20 workstation64 kernel: Oops: 0000 [#1] PREEMPT SMP 
Sep 29 16:59:20 workstation64 kernel: last sysfs file: /sys/module/evdev/initstate
Sep 29 16:59:20 workstation64 kernel: CPU 3 
Sep 29 16:59:20 workstation64 kernel: Modules linked in: ttm(+) serio_raw drm_kms_helper thermal sg mii pcspkr soundcore snd_page_alloc evdev ehci_hcd(+) drm i2c_i801 button iTCO_wdt i2c_algo_bit iTCO_vendor_su
pport i2c_core processor usbcore intel_agp rtc_cmos rtc_core rtc_lib ext4 mbcache jbd2 crc16 sd_mod ahci libata scsi_mod
Sep 29 16:59:20 workstation64 kernel: Pid: 601, comm: modprobe Not tainted 2.6.31-ARCH #1 OEM
Sep 29 16:59:20 workstation64 kernel: RIP: 0010:[<ffffffff811f726b>]  [<ffffffff811f726b>] _raw_spin_lock+0x2b/0x190

more in the log

Comment 19 Tavian Barnes 2009-09-30 16:45:55 UTC

Created attachment 29962 [details] [review]
[PATCH] drm/nouveau: fix missized allocation for ttm_bo_global struct

This patch from http://0x04.net/~mwk/0001-drm-nouveau-fix-missized-allocation-for-ttm_bo_glob.patch fixes it, right?

Comment 20 Andreas Radke 2009-09-30 20:51:39 UTC

Yes, it fixes it for Xavier (i686) and me (x86_64).

Comment 21 Marcin Kościelnicki 2009-10-01 00:35:07 UTC

Seems to work for everyone involved, fix has been commited to git. Closing.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.