Bug 29048 - [965gm] BUG: unable to handle kernel NULL pointer dereference IP: __mutex_lock_slowpath
Summary: [965gm] BUG: unable to handle kernel NULL pointer dereference IP: __mutex_loc...
Status: RESOLVED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: General (show other bugs)
Version: XOrg git
Hardware: x86 (IA32) Linux (All)
: medium normal
Assignee: Dave Airlie
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-07-13 13:03 UTC by Geir Ove Myhr
Modified: 2011-02-21 04:34 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
dmesg with drm.debug=0x04 (79.33 KB, text/plain)
2010-07-13 13:09 UTC, Geir Ove Myhr
no flags Details
dmesg with drm.debug=0x04 when it boots fine, for comparison (73.54 KB, text/plain)
2010-07-13 13:13 UTC, Geir Ove Myhr
no flags Details
Xorg.5.log when bug happens (10.75 KB, text/x-log)
2010-07-13 13:18 UTC, Geir Ove Myhr
no flags Details
Xorg.0.log when there is no problem (29.20 KB, text/x-log)
2010-07-13 13:20 UTC, Geir Ove Myhr
no flags Details
dmesg with 2.6.34-rc4 mainline build (60.02 KB, text/plain)
2010-07-13 14:23 UTC, Geir Ove Myhr
no flags Details
dmesg from 2.6.35-rc5 (58.79 KB, text/plain)
2010-07-14 00:59 UTC, Geir Ove Myhr
no flags Details
dmesg output with Ubuntu kernel 2.6.35-6.8 (based on -rc3) (62.16 KB, text/plain)
2010-07-16 00:29 UTC, Geir Ove Myhr
no flags Details

Description Geir Ove Myhr 2010-07-13 13:03:04 UTC
I hope the title pasted from dmesg output makes sense to you guys.

Forwarding my own Ubuntu bug report:
https://bugs.launchpad.net/ubuntu/+source/xserver-xorg-video-intel/+bug/605124

On current Ubuntu 10.10 development edition with kernel based on 2.6.35-rc4 I sometimes (i.e. between 20% and 50% of the times) get this error in dmesg at boot:

[ 11.287021] BUG: unable to handle kernel NULL pointer dereference at (null)
[ 11.287368] IP: [<c05cb14a>] __mutex_lock_slowpath+0x9a/0x140
[ 11.287647] *pde = 7ce57067
[ 11.287790] Oops: 0002 [#1] SMP
[ 11.287957] last sysfs file: /sys/module/snd_rawmidi/initstate
[ 11.288226] Modules linked in: snd_rawmidi(+) i915(+) snd_seq_midi_event snd_seq drm_kms_helper iwlcore snd_timer mac80211 snd_seq_device drm snd yenta_socket pcmcia_rsrc intel_agp cfg80211 psmouse tpm_tis serio_raw pcmcia_core nvram i2c_algo_bit video tpm tpm_bios soundcore snd_page_alloc agpgart output lp parport usbhid hid firewire_ohci sdhci_pci firewire_core ahci sdhci led_class crc_itu_t e1000e libahci
[ 11.289012]
[ 11.289012] Pid: 370, comm: plymouthd Not tainted 2.6.35-7-generic #12-Ubuntu 7764CTO/7764CTO
[ 11.289012] EIP: 0060:[<c05cb14a>] EFLAGS: 00010246 CPU: 1
[ 11.289012] EIP is at __mutex_lock_slowpath+0x9a/0x140
[ 11.289012] EAX: 00000000 EBX: f40aaa90 ECX: f40aaa98 EDX: f6a77ec0
[ 11.289012] ESI: f40aaa94 EDI: f40aaa90 EBP: f6a77ed8 ESP: f6a77eb4
[ 11.289012] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[ 11.289012] Process plymouthd (pid: 370, ti=f6a76000 task=f70fe580 task.ti=f6a76000)
[ 11.289012] Stack:
[ 11.289012] f6a77ec4 f70fe580 f40aaa98 f40aaa98 00000000 f6a77ed8 f40aaa90 f45be700
[ 11.289012] <0> f40aaa90 f6a77ee8 c05cb095 f45be700 f45be700 f6a77efc f8393dc3 f45be700
[ 11.289012] <0> f45be760 f40aa800 f6a77f44 f838b337 00000001 f839be25 f839b220 f839d590
[ 11.289012] Call Trace:
[ 11.289012] [<c05cb095>] ? mutex_lock+0x25/0x40
[ 11.289012] [<f8393dc3>] ? drm_fb_release+0x23/0x70 [drm]
[ 11.289012] [<f838b337>] ? drm_release+0x3e7/0x4c0 [drm]
[ 11.289012] [<c021a13f>] ? __fput+0xdf/0x1f0
[ 11.289012] [<c021a26d>] ? fput+0x1d/0x30
[ 11.289012] [<c0216b5c>] ? filp_close+0x4c/0x80
[ 11.289012] [<c0216c05>] ? sys_close+0x75/0xc0
[ 11.289012] [<c05cc7dc>] ? syscall_call+0x7/0xb
[ 11.289012] Code: 83 79 18 63 7f b3 8d 76 00 8d 73 04 89 f0 e8 5e 11 00 00 8d 53 08 89 55 e4 8d 55 e8 8b 43 0c 89 53 0c 8b 4d e4 89 45 ec 89 4d e8 <89> 10 8b 45 e0 ba ff ff ff ff 89 45 f0 89 d0 87 03 83 f8 01 74
[ 11.289012] EIP: [<c05cb14a>] __mutex_lock_slowpath+0x9a/0x140 SS:ESP 0068:f6a77eb4
[ 11.289012] CR2: 0000000000000000
[ 11.383770] ---[ end trace b880c9a6dbf71a43 ]---

/var/log/gdm/:0.log then says:

(EE) intel(0): [drm] failed to set drm interface version.
(EE) intel(0): Failed to become DRM master.
[...]
DRM_IOCTL_I915_GEM_APERTURE failed: Bad file descriptor
Assuming 131072kB available aperture size.
May lead to reduced performance or incorrect rendering.
get chip id failed: -1 [9]
param: 4, val: 0
(EE) intel(0): failed to get resources: Bad file descriptor
(EE) intel(0): Kernel modesetting setup failed
(EE) Screen(s) found, but none have a usable configuration.

Fatal server error:
no screens found

After 4-5 attempts to start gdm, Ubuntu's failsafeX starts up.

System information:

DistroRelease: Ubuntu 10.10
Package: xorg 1:7.5+6ubuntu2
ProcVersionSignature: Ubuntu 2.6.35-7.12-generic 2.6.35-rc4
Uname: Linux 2.6.35-7-generic i686
Architecture: i386
InstallationMedia: Ubuntu 10.10 "Maverick Meerkat" - Alpha i386 (20100630.2)
MachineType: LENOVO 7764CTO
ProcCmdLine: BOOT_IMAGE=/boot/vmlinuz-2.6.35-7-generic root=UUID=460f28a3-969c-409a-ab57-6070e2c7288e ro quiet splash drm.debug=0x04
dmi.product.version: ThinkPad X61 Tablet
dmi.sys.vendor: LENOVO
system:
 distro: Ubuntu
 codename: maverick
 architecture: i686
 kernel: 2.6.35-7-generic
Comment 1 Geir Ove Myhr 2010-07-13 13:09:01 UTC
Created attachment 36998 [details]
dmesg with drm.debug=0x04

I added drm.debug=0x04, since Xorg.X.log complained about 
(EE) intel(0): Kernel modesetting setup failed, 
but I realize it may actually just add noise in this case.
Comment 2 Jesse Barnes 2010-07-13 13:11:42 UTC
Presumably this used to work, can you bisect the kernel to find where it broke?
Comment 3 Geir Ove Myhr 2010-07-13 13:13:29 UTC
Created attachment 36999 [details]
dmesg with drm.debug=0x04 when it boots fine, for comparison
Comment 4 Geir Ove Myhr 2010-07-13 13:18:27 UTC
Created attachment 37000 [details]
Xorg.5.log when bug happens
Comment 5 Jesse Barnes 2010-07-13 13:20:35 UTC
Looks like it fails when i915 loads:

[   11.262370] i915 0000:00:02.0: power state changed by ACPI to D0
[   11.262427] i915 0000:00:02.0: power state changed by ACPI to D0
[   11.262437] i915 0000:00:02.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
[   11.262443] i915 0000:00:02.0: setting latency timer to 64
[   11.287021] BUG: unable to handle kernel NULL pointer dereference at (null)
[   11.287368] IP: [<c05cb14a>] __mutex_lock_slowpath+0x9a/0x140
[   11.287647] *pde = 7ce57067 
[   11.287790] Oops: 0002 [#1] SMP 
[   11.287957] last sysfs file: /sys/module/snd_rawmidi/initstate
[   11.288226] Modules linked in: snd_rawmidi(+) i915(+) snd_seq_midi_event snd_seq drm_kms_helper iwlcore snd_timer mac80211 snd_seq_device drm snd yenta_socket pcmcia_rsrc intel_agp cfg80211 psmouse tpm_tis serio_raw pcmcia_core nvram i2c_algo_bit video tpm tpm_bios soundcore snd_page_alloc agpgart output lp parport usbhid hid firewire_ohci sdhci_pci firewire_core ahci sdhci led_class crc_itu_t e1000e libahci
[   11.289012] 
[   11.289012] Pid: 370, comm: plymouthd Not tainted 2.6.35-7-generic #12-Ubuntu 7764CTO/7764CTO
[   11.289012] EIP: 0060:[<c05cb14a>] EFLAGS: 00010246 CPU: 1
[   11.289012] EIP is at __mutex_lock_slowpath+0x9a/0x140
[   11.289012] EAX: 00000000 EBX: f40aaa90 ECX: f40aaa98 EDX: f6a77ec0
[   11.289012] ESI: f40aaa94 EDI: f40aaa90 EBP: f6a77ed8 ESP: f6a77eb4
[   11.289012]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[   11.289012] Process plymouthd (pid: 370, ti=f6a76000 task=f70fe580 task.ti=f6a76000)
[   11.289012] Stack:
[   11.289012]  f6a77ec4 f70fe580 f40aaa98 f40aaa98 00000000 f6a77ed8 f40aaa90 f45be700
[   11.289012] <0> f40aaa90 f6a77ee8 c05cb095 f45be700 f45be700 f6a77efc f8393dc3 f45be700
[   11.289012] <0> f45be760 f40aa800 f6a77f44 f838b337 00000001 f839be25 f839b220 f839d590
[   11.289012] Call Trace:
[   11.289012]  [<c05cb095>] ? mutex_lock+0x25/0x40
[   11.289012]  [<f8393dc3>] ? drm_fb_release+0x23/0x70 [drm]
[   11.289012]  [<f838b337>] ? drm_release+0x3e7/0x4c0 [drm]
[   11.289012]  [<c021a13f>] ? __fput+0xdf/0x1f0
[   11.289012]  [<c021a26d>] ? fput+0x1d/0x30
[   11.289012]  [<c0216b5c>] ? filp_close+0x4c/0x80
[   11.289012]  [<c0216c05>] ? sys_close+0x75/0xc0
[   11.289012]  [<c05cc7dc>] ? syscall_call+0x7/0xb
[   11.289012] Code: 83 79 18 63 7f b3 8d 76 00 8d 73 04 89 f0 e8 5e 11 00 00 8d 53 08 89 55 e4 8d 55 e8 8b 43 0c 89 53 0c 8b 4d e4 89 45 ec 89 4d e8 <89> 10 8b 45 e0 ba ff ff ff ff 89 45 f0 89 d0 87 03 83 f8 01 74 
[   11.289012] EIP: [<c05cb14a>] __mutex_lock_slowpath+0x9a/0x140 SS:ESP 0068:f6a77eb4
[   11.289012] CR2: 0000000000000000
[   11.383770] ---[ end trace b880c9a6dbf71a43 ]---
[   11.399765]   alloc irq_desc for 44 on node -1
[   11.400053]   alloc kstat_irqs on node -1
[   11.401879] i915 0000:00:02.0: irq 44 for MSI/MSI-X
[   11.403369] [drm] set up 7M of stolen space

iirc there were some changes to fb teardown recently, checking now.
Comment 6 Geir Ove Myhr 2010-07-13 13:20:35 UTC
Created attachment 37001 [details]
Xorg.0.log when there is no problem
Comment 7 Geir Ove Myhr 2010-07-13 14:23:04 UTC
Created attachment 37003 [details]
dmesg with 2.6.34-rc4 mainline build

(In reply to comment #6)
> Looks like it fails when i915 loads:
> 
> [   11.262370] i915 0000:00:02.0: power state changed by ACPI to D0
> [   11.262427] i915 0000:00:02.0: power state changed by ACPI to D0
> [   11.262437] i915 0000:00:02.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
> [   11.262443] i915 0000:00:02.0: setting latency timer to 64
> [   11.287021] BUG: unable to handle kernel NULL pointer dereference at (null)

If you mean that the BUG arrives right after the i915 lines, that may be a coincidence. Running with Ubuntu's mainline build of 2.6.35-rc4 I get 

[   13.106855] i915 0000:00:02.0: power state changed by ACPI to D0
[   13.106912] i915 0000:00:02.0: power state changed by ACPI to D0
[   13.106922] i915 0000:00:02.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
[   13.106928] i915 0000:00:02.0: setting latency timer to 64
[   13.135340] e1000e 0000:00:19.0: irq 42 for MSI/MSI-X
[   13.135722] ADDRCONF(NETDEV_UP): eth0: link is not ready
[...]
[   14.464944] e1000e: eth0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: None
[   14.464949] e1000e 0000:00:19.0: eth0: 10/100 speed: disabling TSO
[   14.465228] ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[   14.505338] plymouthd:329 freeing invalid memtype e012c000-e0258000
[   14.508271] BUG: unable to handle kernel NULL pointer dereference at (null)
[   14.508279] IP: [<c05ca358>] __mutex_lock_slowpath+0x88/0x120
[...]

The funny thing is that with 2.6.35-rc4 and -rc5 mainline builds, xorg works just fine, no messages in Xorg.0.log, etc, so I thought it was Ubuntu specific. But the same thing appears in dmesg on some boots.

Will try older mainline builds, and then git-bisect after than if necessary.
Comment 8 Jesse Barnes 2010-07-13 14:37:40 UTC
On Tue, 13 Jul 2010 14:23:04 -0700 (PDT)
bugzilla-daemon@freedesktop.org wrote:

> https://bugs.freedesktop.org/show_bug.cgi?id=29048
> 
> --- Comment #7 from Geir Ove Myhr <gomyhr@gmail.com> 2010-07-13 14:23:04 PDT ---
> Created an attachment (id=37003)
>  --> (https://bugs.freedesktop.org/attachment.cgi?id=37003)
> dmesg with 2.6.34-rc4 mainline build
> 
> (In reply to comment #6)
> > Looks like it fails when i915 loads:
> > 
> > [   11.262370] i915 0000:00:02.0: power state changed by ACPI to D0
> > [   11.262427] i915 0000:00:02.0: power state changed by ACPI to D0
> > [   11.262437] i915 0000:00:02.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
> > [   11.262443] i915 0000:00:02.0: setting latency timer to 64
> > [   11.287021] BUG: unable to handle kernel NULL pointer dereference at (null)
> 
> If you mean that the BUG arrives right after the i915 lines, that may be a
> coincidence. Running with Ubuntu's mainline build of 2.6.35-rc4 I get 
> 
> [   13.106855] i915 0000:00:02.0: power state changed by ACPI to D0
> [   13.106912] i915 0000:00:02.0: power state changed by ACPI to D0
> [   13.106922] i915 0000:00:02.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
> [   13.106928] i915 0000:00:02.0: setting latency timer to 64
> [   13.135340] e1000e 0000:00:19.0: irq 42 for MSI/MSI-X
> [   13.135722] ADDRCONF(NETDEV_UP): eth0: link is not ready
> [...]
> [   14.464944] e1000e: eth0 NIC Link is Up 100 Mbps Full Duplex, Flow Control:
> None
> [   14.464949] e1000e 0000:00:19.0: eth0: 10/100 speed: disabling TSO
> [   14.465228] ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
> [   14.505338] plymouthd:329 freeing invalid memtype e012c000-e0258000
> [   14.508271] BUG: unable to handle kernel NULL pointer dereference at (null)
> [   14.508279] IP: [<c05ca358>] __mutex_lock_slowpath+0x88/0x120
> [...]
> 
> The funny thing is that with 2.6.35-rc4 and -rc5 mainline builds, xorg works
> just fine, no messages in Xorg.0.log, etc, so I thought it was Ubuntu specific.
> But the same thing appears in dmesg on some boots.
> 
> Will try older mainline builds, and then git-bisect after than if necessary.

It looks like plymouth is accessing the DRI device before i915 has
finished loading...
Comment 9 Geir Ove Myhr 2010-07-13 14:41:13 UTC
Forgot to add these package verisons. Fairly new, but not latest and greatest:

xserver-xorg 1:7.5+6ubuntu2
libgl1-mesa-glx 7.8.1-3ubuntu2
libdrm2 2.4.20-2ubuntu1
xserver-xorg-video-intel 2:2.11.0-1ubuntu2
Comment 10 Geir Ove Myhr 2010-07-14 00:59:57 UTC
Created attachment 37011 [details]
dmesg from 2.6.35-rc5

I had problems reproducing the bug after a ureadahead reprofiling didn't work. After rebooting with the Ubuntu kernel which is apparently set up to work with ureadahead, I can now reproduce it again. I attach another dmesg output, since the signature is a bit different this time (but is still the same bug I guess?)

[   16.489542] BUG: unable to handle kernel NULL pointer dereference at 00000008
[   16.489551] IP: [<c0380aa5>] fb_release+0x25/0x60

I won't pretend that I understand these things...

Now that I can reproduce the bug again, I will go back to testing older kernels (but it is time consuming, since it doesn't happen on every boot).
Comment 11 Geir Ove Myhr 2010-07-14 01:06:29 UTC
(In reply to comment #8)
> It looks like plymouth is accessing the DRI device before i915 has
> finished loading...

Is this to say that this probably is plymouth's fault and not the i915 driver, or is it an observation that the i915 driver probably lets plymouth access the DRI device too early? Or is it Ubuntu's fault for setting up the plymouth to load too early?
Comment 12 Jesse Barnes 2010-07-14 08:27:35 UTC
It's an i915/drm problem; even if plymouth opens the device early, it shouldn't cause an oops.  We need better locking or state checking around the drm open call.
Comment 13 Geir Ove Myhr 2010-07-16 00:29:18 UTC
Created attachment 37109 [details]
dmesg output with Ubuntu kernel 2.6.35-6.8 (based on -rc3)

Until now I hadn't seen this problem on -rc3 based kernels, but today I got the IP: fb_release oops with Ubuntu kernel 2.6.35-6.8 (based on -rc3). 

Unlike the -rc4 based Ubuntu kernel this one does not seem to cause any problems for xorg or anything else so far.

I was just about to start testing kernels between -rc3 and -rc4 (there were a bunch of drm/i915 commits there), but I suppose that is useless now.
Comment 14 Geir Ove Myhr 2010-07-19 02:20:43 UTC
Based on a comment from Robert Hooker downstream, I did some testing on the newest -rc5 based Ubuntu kernel (2.6.35-8.13). With this kernel, xorg does not fail, but I still get the kernel oops. 
* Using the (invalid) kernel parameter vesafb=fubar did not change anything.
* With the Grub parameter 'set gfxpayload=text' instead of 'set gfxpayload=keep', I was not able to reproduce the oops.

This is apparently a recent change of defaults in Ubuntu [1]. There are some more comments on this issue on an Ubuntu bug report [2].

[1]: https://lists.ubuntu.com/archives/ubuntu-devel/2010-July/030995.html
[2]: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/606244
Comment 15 Jesse Barnes 2010-07-19 12:48:31 UTC
Looks like you get to choose between an fb_release failure from core DRM or from vesafb...

Dave, can you take a look?  Maybe drm_release needs to check whether KMS has been initialized before calling drm_fb_release...
Comment 16 Geir Ove Myhr 2010-08-29 12:27:32 UTC
This problem has now disappeared. I'm not sure if this is because the bug has been fixed or if other components changed so that the timing is now different.
Comment 17 Timo Aaltonen 2011-02-21 04:34:43 UTC
the downstream bug was closed, so closing this as well


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.