Bug 30645

Summary: Kernel NULL pointer crash when viewing big images in Firefox on Radeon XPress 200M
Product: DRI Reporter: Matthijs Kooijman <matthijs>
Component: DRM/RadeonAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED FIXED QA Contact:
Severity: normal    
Priority: medium CC: lisaev, pumba88
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
Xorg.log
none
Dmesg log and patch for a BUG() in ttm_bo_vm_insert_rb
none
Kernel log messages from a normal system boot none

Description Matthijs Kooijman 2010-10-06 03:08:00 UTC
Created attachment 39215 [details] [review]
 Xorg.log

Since a while, I've been suffering kernel NULL pointer crashes, resulting in a locked up console (e.g., the X display freezes and input devices become unresponsive). The machine is usually still up and running otherwise, I can access it through SSH still (though at least once networking seems to have broken as well).

This issue occurs deterministically when viewing certain images in Firefox. So far I've found two images that trigger this bug, both of which are very big thousands of pixels both ways). I suspect that image size is the only criterium here, but I haven't done structured testing to verify that.

The bug happens as soon as a big part of the image has been loaded (or perhaps when loading is complete, not sure about that). When I press Firefox' stop button halfway through loading the image, nothing breaks and I can view the partial image without problems.

I'm not 100% sure if this is a radeon driver bug, this might be kernel related as well. I'm using 2.6.35.2, self-compiled with KMS enabled. Radeon driver is version 1:6.13.1-2 from Debian.

One example of such a big image that breaks for me is: http://xkcd.com/802_large/

I've attached my Xorg.log, though I don't think it shows any entries relevant to the crash itself.

The crash log from dmesg is as follows:
[140972.373244] [TTM] Failed to find memory space for buffer 0xffff88000bc1f048 eviction.
[140972.373251] [TTM] No space for ffff88000bc1f048 (10713 pages, 42852K, 41M)
[140972.373255] [TTM]   placement[0]=0x00070002 (1)
[140972.373257] [TTM]     has_type: 1
[140972.373259] [TTM]     use_type: 1
[140972.373260] [TTM]     flags: 0x0000000A
[140972.373262] [TTM]     gpu_offset: 0x60000000
[140972.373264] [TTM]     size: 8192
[140972.373266] [TTM]     available_caching: 0x00070000
[140972.373269] [TTM]     default_caching: 0x00010000
[140972.373272] [TTM]  0x00000000-0x00000100:      256: used
[140972.373276] [TTM]  0x00000100-0x00000101:        1: used
[140972.373278] [TTM]  0x00000101-0x00000201:      256: used
[140972.373282] [TTM]  0x00000201-0x00002000:     7679: free
[140972.373284] [TTM]  total: 8192, used 513 free 7679
[140972.373293] [TTM] Failed to find memory space for buffer 0xffff88000bc1f048 eviction.
[140972.373296] [TTM] No space for ffff88000bc1f048 (10713 pages, 42852K, 41M)
[140972.373299] [TTM]   placement[0]=0x00070002 (1)
[140972.373301] [TTM]     has_type: 1
[140972.373303] [TTM]     use_type: 1
[140972.373304] [TTM]     flags: 0x0000000A
[140972.373306] [TTM]     gpu_offset: 0x60000000
[140972.373308] [TTM]     size: 8192
[140972.373310] [TTM]     available_caching: 0x00070000
[140972.373312] [TTM]     default_caching: 0x00010000
[140972.373315] [TTM]  0x00000000-0x00000100:      256: used
[140972.373318] [TTM]  0x00000100-0x00000101:        1: used
[140972.373321] [TTM]  0x00000101-0x00000201:      256: used
[140972.373324] [TTM]  0x00000201-0x00002000:     7679: free
[140972.373327] [TTM]  total: 8192, used 513 free 7679
[140972.373349] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[140972.373353] IP: [<ffffffff8123ece8>] radeon_ttm_bo_destroy+0x38/0xc0
[140972.373363] PGD 5857d067 PUD 5934d067 PMD 0 
[140972.373367] Oops: 0002 [#1] 
[140972.373370] last sysfs file: /sys/devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:04/PNP0C09:00/PNP0C0A:00/power_supply/BAT1/charge_full
[140972.373374] CPU 0 
[140972.373376] Modules linked in: rt2500pci rt2x00pci rt2x00lib mac80211 loop usblp pl2303 usbserial usb_storage cpufreq_ondemand 8139too mii af_packet cpufreq_powersave vboxnetadp vboxnetflt vboxdrv sco bnep rfcomm l2cap crc16 ipv6 fuse dm_snapshot snd_atiixp snd_atiixp_modem snd_ac97_codec snd_usb_audio ac97_bus snd_pcm_oss snd_mixer_oss snd_pcm snd_hwdep snd_usbmidi_lib snd_seq_midi snd_seq_midi_event pcmcia snd_seq snd_rawmidi btusb snd_timer snd_seq_device bluetooth wacom yenta_socket cfg80211 snd pcmcia_rsrc msi_laptop rtc_cmos eeprom_93cx6 rfkill pcmcia_core psmouse rtc_core serio_raw soundcore snd_page_alloc rtc_lib evdev usbhid hid sdhci_pci sg sdhci ohci_hcd ehci_hcd sr_mod usbcore mmc_core sd_mod cdrom unix [last unloaded: mac80211]
[140972.373428] 
[140972.373432] Pid: 16612, comm: Xorg Tainted: G       A    2.6.35.2 #1 /MS-1013
[140972.373435] RIP: 0010:[<ffffffff8123ece8>]  [<ffffffff8123ece8>] radeon_ttm_bo_destroy+0x38/0xc0
[140972.373441] RSP: 0018:ffff880058565b08  EFLAGS: 00010296
[140972.373444] RAX: ffff88000590f600 RBX: ffff88000590f600 RCX: 0000000000000000
[140972.373447] RDX: 0000000000000000 RSI: ffffffff8120a110 RDI: ffff88005b9a4de8
[140972.373450] RBP: ffff880058565b28 R08: 0000000000000000 R09: ffff88000bfaf6c0
[140972.373453] R10: 0000000000000006 R11: 0000000000000001 R12: ffff88000590f648
[140972.373456] R13: ffff88005b9a44e8 R14: ffff88005b9a4850 R15: 00000000029d9000
[140972.373460] FS:  00007f4429211700(0000) GS:ffffffff8161b000(0000) knlGS:00000000f6227920
[140972.373463] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[140972.373466] CR2: 0000000000000008 CR3: 000000005a34a000 CR4: 00000000000006f0
[140972.373469] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[140972.373472] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[140972.373475] Process Xorg (pid: 16612, threadinfo ffff880058564000, task ffff8800133bf840)
[140972.373478] Stack:
[140972.373480]  0000000000000001 ffff88000590f648 ffff88000590f68c ffff88005b9a44e8
[140972.373484] <0> ffff880058565b58 ffffffff8120a1a9 ffff880058565b58 ffff88000590f68c
[140972.373488] <0> ffffffff8120a110 0000000000000001 ffff880058565b78 ffffffff81165d16
[140972.373493] Call Trace:
[140972.373501]  [<ffffffff8120a1a9>] ttm_bo_release_list+0x99/0xf0
[140972.373506]  [<ffffffff8120a110>] ? ttm_bo_release_list+0x0/0xf0
[140972.373511]  [<ffffffff81165d16>] kref_put+0x36/0x70
[140972.373516]  [<ffffffff8120c810>] ? ttm_bo_release+0x0/0x70
[140972.373520]  [<ffffffff8120c86f>] ttm_bo_release+0x5f/0x70
[140972.373524]  [<ffffffff81165d16>] kref_put+0x36/0x70
[140972.373528]  [<ffffffff81209efe>] ttm_bo_unref+0x1e/0x20
[140972.373533]  [<ffffffff8120c698>] ttm_bo_init+0x2e8/0x330
[140972.373537]  [<ffffffff8123ebca>] radeon_bo_create+0x12a/0x210
[140972.373542]  [<ffffffff8123ecb0>] ? radeon_ttm_bo_destroy+0x0/0xc0
[140972.373547]  [<ffffffff81252853>] radeon_gem_object_create+0x83/0xf0
[140972.373552]  [<ffffffff8125290f>] radeon_gem_create_ioctl+0x4f/0xd0
[140972.373557]  [<ffffffff811f5d02>] drm_ioctl+0x2c2/0x430
[140972.373561]  [<ffffffff812528c0>] ? radeon_gem_create_ioctl+0x0/0xd0
[140972.373567]  [<ffffffff810bb618>] vfs_ioctl+0x18/0x70
[140972.373572]  [<ffffffff810bb7bd>] do_vfs_ioctl+0x7d/0x550
[140972.373577]  [<ffffffff81048723>] ? hrtimer_start+0x13/0x20
[140972.373583]  [<ffffffff810ae7de>] ? vfs_read+0xfe/0x160
[140972.373587]  [<ffffffff810bbcda>] sys_ioctl+0x4a/0x80
[140972.373593]  [<ffffffff81002ba8>] system_call_fastpath+0x16/0x1b
[140972.373595] Code: 8d 5f b8 4c 89 65 f0 4c 89 6d f8 49 89 fc 48 8b bb d0 01 00 00 48 81 c7 e8 0d 00 00 e8 c2 64 14 00 49 8b 54 24 b8 49 8b 44 24 c0 <48> 89 42 08 48 89 10 49 89 5c 24 b8 49 89 5c 24 c0 48 8b bb d0 
[140972.373624] RIP  [<ffffffff8123ece8>] radeon_ttm_bo_destroy+0x38/0xc0
[140972.373629]  RSP <ffff880058565b08>
[140972.373630] CR2: 0000000000000008
[140972.373634] ---[ end trace cfaa865bb3d2703d ]---
Comment 1 Alex Deucher 2010-10-06 07:40:36 UTC
Can attach your dmesg output?
Comment 2 Jerome Glisse 2010-10-11 08:59:48 UTC
Looking at code i am puzzle on how this could happen, segfault happen because bo->list.next is NULL which is impossible as we init list before calling ttm and every time we manipulate this list it's with list_del_init
Comment 3 Matthijs Kooijman 2010-10-13 01:32:52 UTC
Thanks for your comments, they allowed me to look around the code a bit and add some debugging instrumentation. I still don't understand the code or your comments completely though.

Regarding your comments: How are you so sure the problem occurs because bo->list.next is NULL? Couldn't it just as well be bo->list.prev, which is also dereferenced in list_del_init? Also, couldn't it be other values, like bo itself, or bo->rdev, etc. ?

I've tried to add some debugging output to my kernel to find out what codepaths are taken exactly and confirm that it is indeed those list pointers causing the problem. However, I've not been able to reproduce the exact same problem anymore so far. Out of four tries, I've completely locked up the machine three times, not allowing me to get at my debug output through SSH. The fourth time the machine was still responsive, but the crash was different. Instead of a NULL pointer, it encountered a BUG() in ttm_bo_vm_insert_rb (at ttm_bo.c:1614, though your line numbers might be slightly different due to my patches).

I'm attaching the dmesg output of this crash, which also includes the patch I applied at the bottom. Note that the "radeon_ttm_bo_destroy: Calling list_del_init with bo->list.prev: %p and bo->list.next: %p\n" message is useless, since I forget to actually pass in those arguments to printk.


I suspect that both of these are symptoms of the same underlying problem, perhaps this helps to find that problem. If you have thoughts about possible causes, please think aloud, I might be able to confirm or disprove any suspicions with some instrumentation.
Comment 4 Matthijs Kooijman 2010-10-13 01:36:02 UTC
Created attachment 39406 [details]
Dmesg log and patch for a BUG() in ttm_bo_vm_insert_rb
Comment 5 Matthijs Kooijman 2010-10-13 01:40:10 UTC
Created attachment 39408 [details]
Kernel log messages from a normal system boot

Here's the dmesg output (or actually, stuff from /var/log/kern.log) from a normal system boot. Is this what you needed?
Comment 6 Lollerke 2010-10-17 11:01:08 UTC
I have the same problem with Xpress 200M and Ubuntu 10.10 32 bit (xserver-xorg-video ati 6.13.1-1ubuntu5, linux-image-2.6.35-22.34 (2.6.35.4), xorg-server 1.9.0-0ubuntu7). Freeze occurs when I view specific pictures in Firefox.

"Oct 17 17:05:15 SatelliteL30 kernel: [ 2708.831735] BUG: unable to handle kernel NULL pointer dereference at 00000004
Oct 17 17:05:15 SatelliteL30 kernel: [ 2708.831756] IP: [<f84832f1>] radeon_ttm_bo_destroy+0x31/0xa0 [radeon]
Oct 17 17:05:15 SatelliteL30 kernel: [ 2708.831818] *pde = 56116067 
Oct 17 17:05:15 SatelliteL30 kernel: [ 2708.831826] Oops: 0002 [#1] SMP 
Oct 17 17:05:15 SatelliteL30 kernel: [ 2708.831836] last sysfs file: /sys/devices/system/cpu/cpu1/cpufreq/scaling_setspeed
Oct 17 17:05:15 SatelliteL30 kernel: [ 2708.831845] Modules linked in: binfmt_misc parport_pc ppdev arc4 joydev snd_hda_codec_realtek pcmcia snd_hda_intel snd_hda_codec snd_hwdep snd_pcm snd_seq_midi snd_rawmidi radeon snd_seq_midi_event snd_seq ath5k snd_timer snd_seq_device mac80211 ttm drm_kms_helper drm ath yenta_socket i2c_algo_bit pcmcia_rsrc snd ati_agp psmouse cfg80211 soundcore snd_page_alloc shpchp pcmcia_core serio_raw i2c_piix4 led_class agpgart video output lp parport 8139too usbhid 8139cp hid mii sata_sil pata_atiixp
Oct 17 17:05:15 SatelliteL30 kernel: [ 2708.831966] 
Oct 17 17:05:15 SatelliteL30 kernel: [ 2708.831976] Pid: 1006, comm: Xorg Not tainted 2.6.35-22-generic #34-Ubuntu Satellite L30/Satellite L30
Oct 17 17:05:15 SatelliteL30 kernel: [ 2708.831986] EIP: 0060:[<f84832f1>] EFLAGS: 00213286 CPU: 0
Oct 17 17:05:15 SatelliteL30 kernel: [ 2708.832027] EIP is at radeon_ttm_bo_destroy+0x31/0xa0 [radeon]
Oct 17 17:05:15 SatelliteL30 kernel: [ 2708.832035] EAX: f32b8800 EBX: f32b8800 ECX: 00000000 EDX: 00000000
Oct 17 17:05:15 SatelliteL30 kernel: [ 2708.832042] ESI: f32b8850 EDI: f400c404 EBP: f42e9d3c ESP: f42e9d30
Oct 17 17:05:15 SatelliteL30 kernel: [ 2708.832050]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
Oct 17 17:05:15 SatelliteL30 kernel: [ 2708.832058] Process Xorg (pid: 1006, ti=f42e8000 task=f322cc20 task.ti=f42e8000)
Oct 17 17:05:15 SatelliteL30 kernel: [ 2708.832064] Stack:
Oct 17 17:05:15 SatelliteL30 kernel: [ 2708.832069]  f32b8850 f32b8878 f400c404 f42e9d50 f8322051 f32b8878 f8321fe0 f400c414
Oct 17 17:05:15 SatelliteL30 kernel: [ 2708.832088] <0> f42e9d60 c0352edd f32b8850 f400c404 f42e9d74 f832477c f32b8874 f8324710
Oct 17 17:05:15 SatelliteL30 kernel: [ 2708.832109] <0> 00000001 f42e9d84 c0352edd f400c404 f32b8850 f42e9d94 f83225f2 fffffff4
Oct 17 17:05:15 SatelliteL30 kernel: [ 2708.832132] Call Trace:
Oct 17 17:05:15 SatelliteL30 kernel: [ 2708.832158]  [<f8322051>] ? ttm_bo_release_list+0x71/0xb0 [ttm]
Oct 17 17:05:15 SatelliteL30 kernel: [ 2708.832174]  [<f8321fe0>] ? ttm_bo_release_list+0x0/0xb0 [ttm]
Oct 17 17:05:15 SatelliteL30 kernel: [ 2708.832191]  [<c0352edd>] ? kref_put+0x2d/0x60
Oct 17 17:05:15 SatelliteL30 kernel: [ 2708.832207]  [<f832477c>] ? ttm_bo_release+0x6c/0x90 [ttm]
Oct 17 17:05:15 SatelliteL30 kernel: [ 2708.832223]  [<f8324710>] ? ttm_bo_release+0x0/0x90 [ttm]
Oct 17 17:05:15 SatelliteL30 kernel: [ 2708.832233]  [<c0352edd>] ? kref_put+0x2d/0x60
Oct 17 17:05:15 SatelliteL30 kernel: [ 2708.832249]  [<f83225f2>] ? ttm_bo_unref+0x32/0x50 [ttm]
Oct 17 17:05:15 SatelliteL30 kernel: [ 2708.832266]  [<f83246a2>] ? ttm_bo_init+0x1a2/0x1f0 [ttm]
Oct 17 17:05:15 SatelliteL30 kernel: [ 2708.832309]  [<f8483521>] ? radeon_bo_create+0x111/0x240 [radeon]
Oct 17 17:05:15 SatelliteL30 kernel: [ 2708.832355]  [<f84832c0>] ? radeon_ttm_bo_destroy+0x0/0xa0 [radeon]
Oct 17 17:05:15 SatelliteL30 kernel: [ 2708.832408]  [<f8496d36>] ? radeon_gem_object_create+0x76/0xe0 [radeon]
Oct 17 17:05:15 SatelliteL30 kernel: [ 2708.832459]  [<f8496e07>] ? radeon_gem_create_ioctl+0x67/0xe0 [radeon]
Oct 17 17:05:15 SatelliteL30 kernel: [ 2708.832474]  [<c01c645d>] ? t_start+0x8d/0x90
Oct 17 17:05:15 SatelliteL30 kernel: [ 2708.832511]  [<f81fd98d>] ? drm_ioctl+0x1ad/0x430 [drm]
Oct 17 17:05:15 SatelliteL30 kernel: [ 2708.832524]  [<c01c645d>] ? t_start+0x8d/0x90
Oct 17 17:05:15 SatelliteL30 kernel: [ 2708.832571]  [<f8496da0>] ? radeon_gem_create_ioctl+0x0/0xe0 [radeon]
Oct 17 17:05:15 SatelliteL30 kernel: [ 2708.832590]  [<c032e636>] ? apparmor_file_permission+0x16/0x20
Oct 17 17:05:15 SatelliteL30 kernel: [ 2708.832605]  [<c0302a84>] ? security_file_permission+0x14/0x20
Oct 17 17:05:15 SatelliteL30 kernel: [ 2708.832620]  [<c02185d2>] ? rw_verify_area+0x62/0xd0
Oct 17 17:05:15 SatelliteL30 kernel: [ 2708.832630]  [<c01c645d>] ? t_start+0x8d/0x90
Oct 17 17:05:15 SatelliteL30 kernel: [ 2708.832640]  [<c0226622>] ? vfs_ioctl+0x32/0xb0
Oct 17 17:05:15 SatelliteL30 kernel: [ 2708.832667]  [<f81fd7e0>] ? drm_ioctl+0x0/0x430 [drm]
Oct 17 17:05:15 SatelliteL30 kernel: [ 2708.832676]  [<c0226eb9>] ? do_vfs_ioctl+0x79/0x2d0
Oct 17 17:05:15 SatelliteL30 kernel: [ 2708.832685]  [<c0227177>] ? sys_ioctl+0x67/0x80
Oct 17 17:05:15 SatelliteL30 kernel: [ 2708.832695]  [<c01c645d>] ? t_start+0x8d/0x90
Oct 17 17:05:15 SatelliteL30 kernel: [ 2708.832707]  [<c05c9114>] ? syscall_call+0x7/0xb
Oct 17 17:05:15 SatelliteL30 kernel: [ 2708.832717]  [<c01c645d>] ? t_start+0x8d/0x90
Oct 17 17:05:15 SatelliteL30 kernel: [ 2708.832726]  [<c01c645d>] ? t_start+0x8d/0x90
Oct 17 17:05:15 SatelliteL30 kernel: [ 2708.832732] Code: 89 1c 24 89 74 24 04 89 7c 24 08 0f 1f 44 00 00 8d 58 b0 89 c6 8b 83 34 01 00 00 05 e0 09 00 00 e8 e5 46 14 c8 8b 56 b0 8b 43 04 <89> 42 04 89 10 89 5e b0 8b 83 34 01 00 00 89 5b 04 05 e0 09 00 
Oct 17 17:05:15 SatelliteL30 kernel: [ 2708.832848] EIP: [<f84832f1>] radeon_ttm_bo_destroy+0x31/0xa0 [radeon] SS:ESP 0068:f42e9d30
Oct 17 17:05:15 SatelliteL30 kernel: [ 2708.832896] CR2: 0000000000000004
Oct 17 17:05:15 SatelliteL30 kernel: [ 2708.832965] ---[ end trace 4821b7bd8308a654 ]---"
Comment 7 Lollerke 2010-10-17 11:24:50 UTC
In my case the freeze happens before the picture shows up.
Comment 8 Dave Airlie 2010-10-20 17:25:45 UTC
is this still happening with 2.6.36?

I'm going to try and upgrade my rs480 today but its a slow machine ;-)
Comment 9 Dave Airlie 2010-10-20 18:15:56 UTC
okay on 64-bit rs480 here but with 2.6.36-rc8 I'm not able to crash it loading the xkcd large image.

So hopefully someone else can ;-)
Comment 10 Leonid Isaev 2010-10-22 14:14:12 UTC
*** Bug 31038 has been marked as a duplicate of this bug. ***
Comment 11 Matthijs Kooijman 2010-10-26 00:38:07 UTC
I've just compiled 2.6.36, which no longer shows this problem. The big images are rendered properly now, there is no lockup and no oopses etc. in dmesg.
Comment 12 Michel Dänzer 2010-10-26 00:45:22 UTC
Resolving per comment #11, thanks for the update.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.