Created attachment 37467 [details] Full dmesg with backtrace. System is Fedora 12 x86_64 with latest updates, KMS and Compiz. Affected kernel is 2.6.35-rc6 compiled from source. The intel driver is xorg-x11-drv-intel-2.9.1-1.fc12.x86_64 from RPM. While testing for kernel bug 15664 (https://bugzilla.kernel.org/show_bug.cgi?id=15664), I got a hang after 7 days of uptime while scrolling down a Firefox window. When sshing from another location, I got the attached backtrace from dmesg. I was using vanilla 2.6.35-rc6 plus a "Repeat unbind during free" patch (https://bugzilla.kernel.org/attachment.cgi?id=27227) that was proposed as a possible fix for the original bug.
Created attachment 37468 [details] Configuration used to compile faulty kernel
Created attachment 37469 [details] lspci -v output for my machine
Use-after-free still seems to be the simplest explanation. Alex please do add any other hangs you encounter, especially with the repeat-unbind patch [as that is a definite fix for one very rare use-after-free bug].
Compiled 2.5.35 with repeat-unbind patch. Testing starting now.
Alex, is this one fixed with current bits + the unbind patch?
(In reply to comment #5) > Alex, is this one fixed with current bits + the unbind patch? As this bug seems random, I am not sure whether I will hit this again. I have been running 2.6.35 plus repeat-unbind for 3 days without incident.
I'm an incorrigible optimist.
Got the hang and backtrace again. Kernel is 2.6.35 with repeat-unbind patch. Full dmesg output follows.
Created attachment 37793 [details] Kernel dmesg with backtrace.
This crash was triggered by repeatedly scrolling a Firefox window up and down with the mouse wheel. Again.
Hmm, an identical trace. The chances of it being random corruption are reduced. Alex, can you build your kernel with as much debugging enabled as possible (mutex, spinlock, allocation, etc)? kmemcheck would be perfect but unfeasible to use on a desktop system. The implication is still a use-after-free bug. It might just be related to commit 86f100b136626e91f4f66f3776303475e2e58998 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Sat Jul 24 21:03:49 2010 +0100 drm/i915: Unreference object not handle on creation When creating an object, we create the handle by which it is known to the process and which own the reference to the object. That reference to the new handle is what we want to transfer to the process, not the lost reference to the object; so free the local object reference *not* the process's handle reference. This brings i915_gem_object_create_ioctl() into line with drm_gem_open_ioctl() Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Eric Anholt <eric@anholt.net> but I admit to be clutching at straws.
(In reply to comment #11) > Hmm, an identical trace. The chances of it being random corruption are reduced. > Alex, can you build your kernel with as much debugging enabled as possible > (mutex, spinlock, allocation, etc)? kmemcheck would be perfect but unfeasible > to use on a desktop system. Done. My diff from the previous configuration (I hope this is enough): --- linux-config-nodebug.config 2010-08-12 10:17:02.000000000 -0500 +++ linux-config-debug.config 2010-08-12 11:37:45.000000000 -0500 @@ -1,7 +1,7 @@ # # Automatically generated make config: don't edit # Linux kernel version: 2.6.35 -# Mon Aug 2 10:08:54 2010 +# Thu Aug 12 10:17:46 2010 # CONFIG_64BIT=y # CONFIG_X86_32 is not set @@ -69,7 +69,7 @@ CONFIG_LOCK_KERNEL=y CONFIG_INIT_ENV_ARG_LIMIT=32 CONFIG_CROSS_COMPILE="" -CONFIG_LOCALVERSION="" +CONFIG_LOCALVERSION="-debug" # CONFIG_LOCALVERSION_AUTO is not set CONFIG_HAVE_KERNEL_GZIP=y CONFIG_HAVE_KERNEL_BZIP2=y @@ -4003,8 +4003,13 @@ CONFIG_SCHED_DEBUG=y CONFIG_SCHEDSTATS=y CONFIG_TIMER_STATS=y -# CONFIG_DEBUG_OBJECTS is not set -# CONFIG_SLUB_DEBUG_ON is not set +CONFIG_DEBUG_OBJECTS=y +# CONFIG_DEBUG_OBJECTS_SELFTEST is not set +CONFIG_DEBUG_OBJECTS_FREE=y +CONFIG_DEBUG_OBJECTS_TIMERS=y +CONFIG_DEBUG_OBJECTS_WORK=y +CONFIG_DEBUG_OBJECTS_ENABLE_DEFAULT=1 +CONFIG_SLUB_DEBUG_ON=y # CONFIG_SLUB_STATS is not set # CONFIG_DEBUG_KMEMLEAK is not set CONFIG_DEBUG_PREEMPT=y @@ -4014,22 +4019,24 @@ CONFIG_DEBUG_SPINLOCK=y CONFIG_DEBUG_MUTEXES=y CONFIG_DEBUG_LOCK_ALLOC=y -# CONFIG_PROVE_LOCKING is not set +CONFIG_PROVE_LOCKING=y +# CONFIG_PROVE_RCU is not set CONFIG_LOCKDEP=y # CONFIG_LOCK_STAT is not set # CONFIG_DEBUG_LOCKDEP is not set +CONFIG_TRACE_IRQFLAGS=y CONFIG_DEBUG_SPINLOCK_SLEEP=y # CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set CONFIG_STACKTRACE=y -# CONFIG_DEBUG_KOBJECT is not set +CONFIG_DEBUG_KOBJECT=y CONFIG_DEBUG_BUGVERBOSE=y CONFIG_DEBUG_INFO=y -# CONFIG_DEBUG_VM is not set -# CONFIG_DEBUG_VIRTUAL is not set +CONFIG_DEBUG_VM=y +CONFIG_DEBUG_VIRTUAL=y # CONFIG_DEBUG_WRITECOUNT is not set CONFIG_DEBUG_MEMORY_INIT=y CONFIG_DEBUG_LIST=y -# CONFIG_DEBUG_SG is not set +CONFIG_DEBUG_SG=y CONFIG_DEBUG_NOTIFIERS=y # CONFIG_DEBUG_CREDENTIALS is not set CONFIG_ARCH_WANT_FRAME_POINTERS=y @@ -4045,8 +4052,8 @@ CONFIG_CPU_NOTIFIER_ERROR_INJECT=m # CONFIG_FAULT_INJECTION is not set CONFIG_LATENCYTOP=y -# CONFIG_SYSCTL_SYSCALL_CHECK is not set -# CONFIG_DEBUG_PAGEALLOC is not set +CONFIG_SYSCTL_SYSCALL_CHECK=y +CONFIG_DEBUG_PAGEALLOC=y CONFIG_USER_STACKTRACE_SUPPORT=y CONFIG_HAVE_FUNCTION_TRACER=y CONFIG_HAVE_FUNCTION_GRAPH_TRACER=y
Created attachment 37819 [details] Initial dmesg with debug kernel Currently running debug kernel. I got the attached dmesg. Nothing yet about a backtrace. However I notice these weird messages: [ 9.234337] kobject: 'holders': free name [ 9.234340] kobject: 'intel_rng' (ffffffffa000e710): kobject_cleanup [ 9.234341] kobject: 'intel_rng' (ffffffffa000e710): does not have a release() function, it is broken and must be fixed. [ 10.839184] kobject: 'holders': free name [ 10.839186] kobject: 'scsi_wait_scan' (ffffffffa00c7090): kobject_cleanup [ 10.839188] kobject: 'scsi_wait_scan' (ffffffffa00c7090): does not have a release() function, it is broken and must be fixed. [ 14.300151] kobject: 'holders': free name [ 14.300154] kobject: 'microcode' (ffffffffa0128070): kobject_cleanup [ 14.300156] kobject: 'microcode' (ffffffffa0128070): does not have a release() function, it is broken and must be fixed. They do not seem to be related to my problem, though.
Created attachment 37928 [details] dmesg with debug enabled and "Poison overwritten" backtrace I got a "BUG kmalloc-512: Poison overwritten" report on 2.6.35. The dmesg report is attached. The i915_* functions appear on the backtrace.
Testing since today with 2.6.36-rc3. No additional patches on top of vanilla kernel, but all previous debugging is enabled.
(In reply to comment #14) > Created an attachment (id=37928) [details] > dmesg with debug enabled and "Poison overwritten" backtrace > > I got a "BUG kmalloc-512: Poison overwritten" report on 2.6.35. The dmesg > report is attached. The i915_* functions appear on the backtrace. Unrelated to us. c0 93 11 2d 01 88 ff ff would appear to be an ip address and netmask, so the suspicion lies with the br code - please do forward them that stack trace. And thanks for running with all the extra debugging.
Still have no clue as to what's happening here. Can you grab an oops from git://git.kernel.org/pub/scm/linux/kernel/git/ickle/drm-intel.git drm-intel-next so that (a) we verify that it still occurs and (b) so that I can match the disassembly to the source code?
Could not (yet) trigger the hang during all tests with 2.6.36-rc3. Testing now with 2.6.36-rc4.
Created attachment 38765 [details] Backtrace with 2.6.36-rc4 Arrrhghh!!! I thought the -rc4 was free from the bug at last!
I have stared long and hard at all the traces and I simply do not see how we get an invalid pointer deference there. Downgrading the priority since the bug seems to be quite rare.
Kernel bug report for poison overwritten at https://bugzilla.kernel.org/show_bug.cgi?id=18842 Now testing for the bug with commit 8b15575cae7a93a784c3005c42b069edd9ba64dd past 2.6.36-rc5 in mainline kernel... debug enabled as before.
Created attachment 40238 [details] dmesg output with backtrace at the end for recompiled 2.6.36
Created attachment 40239 [details] config file for recompiled 2.6.36 kernel
Created attachment 40240 [details] Xorg output for recompiled 2.6.36
For a test, I took the kernel config file for the Fedora 14 x86_64 LiveCD, and ran "make oldconfig" for the vanilla 2.6.36 kernel. Unfortunately this recompiled kernel crashed too, but the messages shown are slightly different. I attached the dmesg with the backtrace, as well as the Xorg.0.log file, and the kernel configuration file after "make oldconfig". I see there are quite a few lines in the Xorg log that read "i830_uxa_prepare_access: bo map failed". In addition, I see about the same number of lines in the dmesg output that say "[drm:i915_gem_create_mmap_offset] *ERROR* failed to add to map hash", around 45 seconds before the crash. Does this add anything new?
Those errors would be consistent would the earlier crash; both concern some memory corruption with map hash table. I admit to still being puzzled.
Created attachment 41854 [details] dmesg output with backtrace for 2.6.37 Still present in vanilla 2.6.37. Is there a way to add some integrity check to drm_ht_insert_item and friends, so that a hash map corruption might be detected as soon as possible? In this way I could correlate the warning with whatever activity is happening at the moment, to narrow down the causes. Better yet, a way to warn/prevent accesses to the memory occupied by the hash by code other than the hash functions. Bug has now seeped into Fedora 14 stable - see https://bugzilla.redhat.com/show_bug.cgi?id=665887 .
Quick locking audit for the hash table. The offset_hash is touched in drm_gem_init and drm_gem_destroy, both of which only happen at load/unload, so that's not the issue. drm_gem_mmap, i915_gem_create_mmap_offset, and i915_gem_free_mmap_offset also touch this; the first doing lookups and the latter two doing insert and remove. drm_gem_mmap takes the struct_mutex around the lookup. i915_gem_create_mmap_offset also takes the lock through its caller, i915_gem_mmap_gtt_ioctl, so it's also safe. i915_gem_free_mmap_offset is called by i915_gem_free_object_tail, which does i915_gem_object_unbind, so it had better be locked. It's called by both i915_gem_retire_requests and i915_gem_free_object. AFAICT those paths are protected by the struct_mutex as well, so it doesn't appear to be a race. All the [267379.215094] [drm:i915_gem_create_mmap_offset] *ERROR* failed to add to map hash messages are concerning, it means the hash key didn't match, which certainly implies that the h_list got corrupted. But there were quite a few free code paths, so maybe one really is dropping the lock? You could try something like this: --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -1331,6 +1331,8 @@ i915_gem_free_mmap_offset(struct drm_i915_gem_object *obj) struct drm_gem_mm *mm = dev->mm_private; struct drm_map_list *list = &obj->base.map_list; + BUG_ON(!mutex_is_locked(&dev->struct_mutex)); + drm_ht_remove_item(&mm->offset_hash, &list->hash); drm_mm_put_block(list->file_offset_node); kfree(list->map); to confirm that. It's also possible that hashed_key is bogus (see drm_ht_insert_item), so you could add a bounds check to make sure it's within the size of ht->table[].
Created attachment 42203 [details] [review] Patch to add two BUG_ON on 2.6.38-rc1 This patch contains the changes I made to vanilla 2.6.38-rc1. In addition to the suggested patch in the last comment, I also added a BUG_ON on drm_ht_insert_item with a check for hr->size. I will run 2.6.37-rc1 for as long as possible (until I get a crash, or -rc2 is available).
Created attachment 42292 [details] dmesg output with messages for 2.6.38-rc1 The message "[drm:i915_gem_create_mmap_offset] *ERROR* failed to add to map hash" appeared again, but now the failure mode is different. With 2.6.38-rc1 my entire graphical session crashed when I tried to scroll a listbox in Eclipse, but then the machine bounced back to the graphical login without hanging. The attached file contains the message log for the kernel. Notice that neither WARN_ON triggered in any moment. There were two separate episodes of message flood. The first one was from 84909.359537 to 84909.364938 and the second one was from 89647.299393 to 89649.534727 .
Created attachment 42293 [details] Xorg.0.log right after the crash on 2.6.38-rc1 .
Created attachment 42485 [details] dmesg output with messages for 2.6.38-rc2 Bug still present in 2.6.38-rc2. Now the crash has changed slightly. This time the message starts with: [26443.152330] BUG: unable to handle kernel paging request at 0000100000000000 [26443.152393] IP: [<ffffffffa002dc80>] drm_ht_insert_item+0x62/0xb7 [drm] As before, neither WARN_ON triggered before this crash. From what little I can understand from reading the code, the message "...failed to add to map hash..." should only trigger when attempting to insert a duplicate key into a hash. I also gather that the "key" is actually a memory address returned as part of a structure from drm_mm_get_block_generic(). So the memory allocator is either returning the same address to multiple allocation requests, or the hash table is not being cleared of no-longer-used memory addresses. I am thinking about adding trace messages to dump the contents of the hashmap and trace messages to drm_ht_insert_item to see where in the implementation the crash occurs. BTW, 0x0000100000000000 seems an invalid address, and the previous analysis does not tell me how the code gets to reference such an address.
Did you ever check your machine with memtest86? But if it is bad memory then it is being incredibly specific all this time... I still don't understand how you appear to be the only person who suffers from this bug considering the apparent severity.
(In reply to comment #33) > Did you ever check your machine with memtest86? But if it is bad memory then it > is being incredibly specific all this time... I still don't understand how you > appear to be the only person who suffers from this bug considering the apparent > severity. Yes, I have checked with memtest. No hints of bad memory. No, I am not the only person who has seen this bug. See https://bugzilla.redhat.com/show_bug.cgi?id=665887 comment 3.
(In reply to comment #34) > (In reply to comment #33) > > Did you ever check your machine with memtest86? But if it is bad memory then it > > is being incredibly specific all this time... I still don't understand how you > > appear to be the only person who suffers from this bug considering the apparent > > severity. > > Yes, I have checked with memtest. No hints of bad memory. > > No, I am not the only person who has seen this bug. See > https://bugzilla.redhat.com/show_bug.cgi?id=665887 comment 3. Also, these reports look like the same bug: https://lists.linux-foundation.org/pipermail/bugme-new/2010-December/026319.html http://kerneltrap.org/mailarchive/linux-kernel/2010/7/8/4591541 http://ubuntuforums.org/showthread.php?t=1385321
Created attachment 42526 [details] dmesg output with messages for 2.6.38-rc2 - new crash I got another crash (with mutter/gnome-shell, apparently when hovering mouse pointer over thunderbird and displaying a tooltip) on 2.6.38-rc2. However this one is different from the previous ones.
Created attachment 42536 [details] [review] Patch to add additional tracing to DRM hash table functions In order to assist in locating the bug, I created this patch. This patch is supposed to dump the state of the hash bucket for the selected key, for drm_ht_insert_item, drm_ht_remove*, and so on. After recompiling, I booted with drm.debug=1 and will work normally until a hang occurs. BTW, what is the purpose of the 'fill' struct field of 'struct drm_open_hash' ? The remove functions decrement this variable, but I see no corresponding increment, and no other references to this variable in the drm code.
Created attachment 42537 [details] dmesg output with trace patch - "normal" case Baseline situation with machine working "normally", and trace patch active.
One thing to try is whether diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c index ea1c4b0..cd54e67 100644 --- a/drivers/gpu/drm/drm_gem.c +++ b/drivers/gpu/drm/drm_gem.c @@ -101,7 +101,7 @@ drm_gem_init(struct drm_device *dev) dev->mm_private = mm; - if (drm_ht_create(&mm->offset_hash, 19)) { + if (drm_ht_create(&mm->offset_hash, 9)) { kfree(mm); return -ENOMEM; } has any affect on the frequency of this bug.
Created attachment 42864 [details] [review] A different version of the trace patch This version of the patch makes drm_ht_verbose_list() and the traces use DRM_INFO instead of DRM_DEBUG for the messages. This is required as there is quite a bit of output, and drm.debug=1 causes mangled output in /var/log/messages, apparently because it is generated way faster than what syslog can handle.
I have news on this bug. I am now testing on 2.6.38-rc3, and I have witnessed an interesting hash map corruption by means of the latest trace patch: After running for one day and leaving Vuze running overnight, I resumed work today, and was alerted of a WARN_ON from my patch. At time 73152.321693, drm tried to insert key 0x001306ec which hashed to 0x00000057. At that time there were 4 items in the hash bucket: Feb 2 11:57:18 srv64 kernel: [73152.321693] [drm] drm_ht_insert_item: before insert, fill=4294684641 Feb 2 11:57:18 srv64 kernel: [73152.321697] [drm] Key is 0x001306ec, Hashed key is 0x00000057 Feb 2 11:57:18 srv64 kernel: [73152.321702] [drm] count 0, key: 0x001260d1 list: 0xffff880037a10828 list->next: 0xffff880037af9428 Feb 2 11:57:18 srv64 kernel: [73152.321706] [drm] count 1, key: 0x0012ae80 list: 0xffff880037af9428 list->next: 0xffff88000434ce28 Feb 2 11:57:18 srv64 kernel: [73152.321710] [drm] count 2, key: 0x0012bcbf list: 0xffff88000434ce28 list->next: 0xffff88011bbbb028 Feb 2 11:57:18 srv64 kernel: [73152.321714] [drm] count 3, key: 0x00130e80 list: 0xffff88011bbbb028 list->next: 0x (null) Feb 2 11:57:18 srv64 kernel: [73152.321717] [drm] drm_ht_insert_item: after insert, fill=4294684641 Feb 2 11:57:18 srv64 kernel: [73152.321720] [drm] Key is 0x001306ec, Hashed key is 0x00000057 Feb 2 11:57:18 srv64 kernel: [73152.321724] [drm] count 0, key: 0x001260d1 list: 0xffff880037a10828 list->next: 0xffff880037af9428 Feb 2 11:57:18 srv64 kernel: [73152.321728] [drm] count 1, key: 0x0012ae80 list: 0xffff880037af9428 list->next: 0xffff88000434ce28 Feb 2 11:57:18 srv64 kernel: [73152.321732] [drm] count 2, key: 0x0012bcbf list: 0xffff88000434ce28 list->next: 0xffff8800378e6a28 Feb 2 11:57:18 srv64 kernel: [73152.321737] [drm] count 3, key: 0x001306ec list: 0xffff8800378e6a28 list->next: 0xffff88011bbbb028 Feb 2 11:57:18 srv64 kernel: [73152.321741] [drm] count 4, key: 0x00130e80 list: 0xffff88011bbbb028 list->next: 0x (null) Later, at 73206.237985, drm tried to remove key 0x001306ec. However, the linked list was now corrupted: Feb 2 11:58:12 srv64 kernel: [73206.237985] [drm] drm_ht_remove_item: before remove, fill=4294683133 Feb 2 11:58:12 srv64 kernel: [73206.237987] [drm] Key is 0x001306ec, Hashed key is 0x00000057 Feb 2 11:58:12 srv64 kernel: [73206.237990] [drm] count 0, key: 0x001260d1 list: 0xffff880037a10828 list->next: 0xffff880037af9428 Feb 2 11:58:12 srv64 kernel: [73206.237992] [drm] count 1, key: 0x0012ae80 list: 0xffff880037af9428 list->next: 0xffff88000c7abc20 Feb 2 11:58:12 srv64 kernel: [73206.237994] [drm] count 2, key: 0xffff8801365166c0 list: 0xffff88000c7abc20 list->next: 0xffff88000c7abe20 Feb 2 11:58:12 srv64 kernel: [73206.237997] [drm] count 3, key: 0xffff8801365166c0 list: 0xffff88000c7abe20 list->next: 0xffff88000c7abc20 Feb 2 11:58:12 srv64 kernel: [73206.237999] [drm] count 4, key: 0xffff8801365166c0 list: 0xffff88000c7abc20 list->next: 0xffff88000c7abe20 Feb 2 11:58:12 srv64 kernel: [73206.238005] [drm] count 5, key: 0xffff8801365166c0 list: 0xffff88000c7abe20 list->next: 0xffff88000c7abc20 Feb 2 11:58:12 srv64 kernel: [73206.238008] [drm] count 6, key: 0xffff8801365166c0 list: 0xffff88000c7abc20 list->next: 0xffff88000c7abe20 Feb 2 11:58:12 srv64 kernel: [73206.238010] [drm] count 7, key: 0xffff8801365166c0 list: 0xffff88000c7abe20 list->next: 0xffff88000c7abc20 Feb 2 11:58:12 srv64 kernel: [73206.238013] [drm] count 8, key: 0xffff8801365166c0 list: 0xffff88000c7abc20 list->next: 0xffff88000c7abe20 Feb 2 11:58:12 srv64 kernel: [73206.238015] [drm] count 9, key: 0xffff8801365166c0 list: 0xffff88000c7abe20 list->next: 0xffff88000c7abc20 Feb 2 11:58:12 srv64 kernel: [73206.238018] [drm] count 10, key: 0xffff8801365166c0 list: 0xffff88000c7abc20 list->next: 0xffff88000c7abe20 Feb 2 11:58:12 srv64 kernel: [73206.238021] [drm] count 11, key: 0xffff8801365166c0 list: 0xffff88000c7abe20 list->next: 0xffff88000c7abc20 Feb 2 11:58:12 srv64 kernel: [73206.238023] [drm] count 12, key: 0xffff8801365166c0 list: 0xffff88000c7abc20 list->next: 0xffff88000c7abe20 Feb 2 11:58:12 srv64 kernel: [73206.238026] [drm] count 13, key: 0xffff8801365166c0 list: 0xffff88000c7abe20 list->next: 0xffff88000c7abc20 Feb 2 11:58:12 srv64 kernel: [73206.238028] [drm] count 14, key: 0xffff8801365166c0 list: 0xffff88000c7abc20 list->next: 0xffff88000c7abe20 Feb 2 11:58:12 srv64 kernel: [73206.238030] [drm] count 15, key: 0xffff8801365166c0 list: 0xffff88000c7abe20 list->next: 0xffff88000c7abc20 Feb 2 11:58:12 srv64 kernel: [73206.238032] ------------[ cut here ]------------ Feb 2 11:58:12 srv64 kernel: [73206.238061] WARNING: at drivers/gpu/drm/drm_hashtab.c:81 drm_ht_verbose_list+0x59/0x96 [drm]() Feb 2 11:58:12 srv64 kernel: [73206.238063] Hardware name: Feb 2 11:58:12 srv64 kernel: [73206.238065] Modules linked in: tcp_lp bluetooth rfkill fuse ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat bridge stp llc deflate zlib_deflate ctr camellia cast5 rmd160 crypto_null ccm serpent blowfish twofish_x86_64 twofish_common ecb xcbc cbc sha256_generic sha512_generic des_generic cryptd aes_x86_64 aes_generic ah6 ah4 esp6 esp4 xfrm4_mode_beet xfrm4_tunnel tunnel4 xfrm4_mode_tunnel xfrm4_mode_transport xfrm6_mode_transport xfrm6_mode_ro xfrm6_mode_beet xfrm6_mode_tunnel ipcomp ipcomp6 xfrm_ipcomp xfrm6_tunnel tunnel6 af_key vboxnetadp vboxnetflt vboxdrv sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf capi capifs kernelcapi ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables ipv6 kvm_intel kvm uinput snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer ppdev parport_pc snd iTCO_wdt iTCO_vendor_support soundcore snd_page_alloc r8169 pcspkr parport i2c_i801 mii microcode i915 drm_kms_helper drm i Feb 2 11:58:12 srv64 kernel: 2c_algo_bit i2c_core video [last unloaded: scsi_wait_scan] Feb 2 11:58:12 srv64 kernel: [73206.238138] Pid: 1975, comm: Xorg Not tainted 2.6.38-rc3 #5 Feb 2 11:58:12 srv64 kernel: [73206.238139] Call Trace: Feb 2 11:58:12 srv64 kernel: [73206.238146] [<ffffffff810515ef>] ? warn_slowpath_common+0x85/0x9d Feb 2 11:58:12 srv64 kernel: [73206.238150] [<ffffffff81051621>] ? warn_slowpath_null+0x1a/0x1c Feb 2 11:58:12 srv64 kernel: [73206.238159] [<ffffffffa002dcf4>] ? drm_ht_verbose_list+0x59/0x96 [drm] Feb 2 11:58:12 srv64 kernel: [73206.238164] [<ffffffff811212cd>] ? lookup_page_cgroup+0x36/0x4c Feb 2 11:58:12 srv64 kernel: [73206.238174] [<ffffffffa002dd69>] ? drm_ht_remove_item+0x38/0x8e [drm] Feb 2 11:58:12 srv64 kernel: [73206.238177] [<ffffffff811212cd>] ? lookup_page_cgroup+0x36/0x4c Feb 2 11:58:12 srv64 kernel: [73206.238191] [<ffffffffa007a61b>] ? i915_gem_free_object_tail+0x78/0xd3 [i915] Feb 2 11:58:12 srv64 kernel: [73206.238199] [<ffffffffa002707e>] ? drm_gem_object_free+0x0/0x2d [drm] Feb 2 11:58:12 srv64 kernel: [73206.238210] [<ffffffffa007c7f3>] ? i915_gem_free_object+0x80/0x88 [i915] Feb 2 11:58:12 srv64 kernel: [73206.238218] [<ffffffffa002707e>] ? drm_gem_object_free+0x0/0x2d [drm] Feb 2 11:58:12 srv64 kernel: [73206.238226] [<ffffffffa00270a9>] ? drm_gem_object_free+0x2b/0x2d [drm] Feb 2 11:58:12 srv64 kernel: [73206.238230] [<ffffffff81231e5e>] ? kref_put+0x43/0x4d Feb 2 11:58:12 srv64 kernel: [73206.238238] [<ffffffffa0026fad>] ? drm_gem_object_unreference_unlocked+0x33/0x40 [drm] Feb 2 11:58:12 srv64 kernel: [73206.238246] [<ffffffffa002705a>] ? drm_gem_object_handle_unreference_unlocked+0x35/0x39 [drm] Feb 2 11:58:12 srv64 kernel: [73206.238255] [<ffffffffa00273ef>] ? drm_gem_close_ioctl+0x79/0x84 [drm] Feb 2 11:58:12 srv64 kernel: [73206.238263] [<ffffffffa00258b8>] ? drm_ioctl+0x2a8/0x385 [drm] Feb 2 11:58:12 srv64 kernel: [73206.238267] [<ffffffff811f9763>] ? inode_has_perm.clone.19+0x76/0x8c Feb 2 11:58:12 srv64 kernel: [73206.238275] [<ffffffffa0027376>] ? drm_gem_close_ioctl+0x0/0x84 [drm] Feb 2 11:58:12 srv64 kernel: [73206.238279] [<ffffffff81048a9a>] ? get_parent_ip+0x11/0x41 Feb 2 11:58:12 srv64 kernel: [73206.238282] [<ffffffff81048a9a>] ? get_parent_ip+0x11/0x41 Feb 2 11:58:12 srv64 kernel: [73206.238286] [<ffffffff81481005>] ? sub_preempt_count+0x97/0xaa Feb 2 11:58:12 srv64 kernel: [73206.238290] [<ffffffff81113665>] ? __bit_spin_unlock.clone.42+0x2c/0x37 Feb 2 11:58:12 srv64 kernel: [73206.238293] [<ffffffff811f981e>] ? file_has_perm+0xa5/0xc7 Feb 2 11:58:12 srv64 kernel: [73206.238296] [<ffffffff811333a7>] ? do_vfs_ioctl+0x4a4/0x4e5 Feb 2 11:58:12 srv64 kernel: [73206.238300] [<ffffffff8113343e>] ? sys_ioctl+0x56/0x7c Feb 2 11:58:12 srv64 kernel: [73206.238303] [<ffffffff8100ac82>] ? system_call_fastpath+0x16/0x1b Feb 2 11:58:12 srv64 kernel: [73206.238306] ---[ end trace d10b7b5570a59e52 ]--- Notice the weird pattern: [drm] count 0, key: 0x001260d1 list: 0xffff880037a10828 list->next: 0xffff880037af9428 [drm] count 1, key: 0x0012ae80 list: 0xffff880037af9428 list->next: 0xffff88000c7abc20 [drm] count 2, key: 0xffff8801365166c0 list: 0xffff88000c7abc20 list->next: 0xffff88000c7abe20 [drm] count 3, key: 0xffff8801365166c0 list: 0xffff88000c7abe20 list->next: 0xffff88000c7abc20 [drm] count 4, key: 0xffff8801365166c0 list: 0xffff88000c7abc20 list->next: 0xffff88000c7abe20 The node at count 1 got its "next" pointer corrupted with the value 0xffff88000c7abc20, which then points to memory that, when interpreted as nodes, decodes to two nodes that happen to point to each other in a loop. The trace patch detects this and prints a WARN_ON. I have the full log available (74 MB after gzip), but I think the bugtracker will not accept it. The weirdest thing is that the machine is still running as I type this - no hangs (yet). What do you think about this?
Full messages being uploaded at: http://srv25.palosanto.com/palosanto.com/~a_villacis/kernel-messages-2.6.38-rc3.txt.gz Complete file size is 76559430 bytes.
It means something else scribbled over a live object in that 50s interval whilst your back was turned. Use-after-free somewhere else in the kernel.
Created attachment 42872 [details] [review] Slab allocate objects. So if I am right in that some other layer is scribbling over old memory, then allocating the objects from a separate slab should prevent that...
(In reply to comment #44) > Created an attachment (id=42872) [details] > Slab allocate objects. > > So if I am right in that some other layer is scribbling over old memory, then > allocating the objects from a separate slab should prevent that... By "some other layer", do you mean somewhere higher up in the intel driver? The corruption happened while working in the graphics session in gnome-shell. Besides that clue, is there a hint in the corrupted values that points specifically at the files you touch with the patch?
Actually, I meant some other bit of kernel code that shares the same kmalloc cache as struct drm_i915_gem_object. That debug log does not reveal where the corruption came from, but does tell us that it has nothing to do with the mmap_offset ht.
(In reply to comment #46) > Actually, I meant some other bit of kernel code that shares the same kmalloc > cache as struct drm_i915_gem_object. > > That debug log does not reveal where the corruption came from, but does tell us > that it has nothing to do with the mmap_offset ht. Then, is there a way to query which portions of the kernel share(d) allocations with struct drm_i915_gem_object ? This would narrow down the suspects.
Turn on slab debugging. That should give you some directories under /sys/kernel/debug/ that describe the objects allocated from each of the caches (including the callers).
(In reply to comment #48) > Turn on slab debugging. That should give you some directories under > /sys/kernel/debug/ that describe the objects allocated from each of the caches > (including the callers). I rebooted with "slub_debug=FZP", but I still get nothing under /sys/kernel/debug/ Was this the right thing to do?
Created attachment 42956 [details] Crash after slab-allocate patch After running a day-long session with gnome-shell, I logged out of the session, then logged in again. After the desktop finished loading, I tried moving the mouse, only to find that the session froze. After a minute or so, the PC beeped but remained with a frozen display. I sshd into the machine and got the backtrace attached here. Now it is a different place and call stack... Again I ask, was the slub_debug kernel parameter enough, or should I do something else, like recompiling the kernel with some debug option?
Created attachment 43166 [details] slabinfo -v report for vanilla 2.6.38-rc4 kernel Currently running 2.6.38-rc4 with slub_debug=FZP . This morning, after leaving the computer on all night, I ran slabinfo -v. The attached report appeared in /var/log/messages. I ran slabinfo -v a few times, and collected various reports attached in the file.
Created attachment 43340 [details] Another memory corruption report with i915 in separate slab I recompiled the kernel with the patch to allocate objects in a separate slab, and rebooted with slub_debug=FZPU . After a weekend of running Vuze the bittorrent client, I found the attached memory corruptions. The very first one in the file was found without me using any commands. The rest were found with slabinfo -v. Is this enough evidence to conclude that some network code is corrupting memory? Or could this be just the network data being corrupted in the place of i915?
What's your ipaddr/netmask?
Only the first instance looks like an ipaddr/netmask. The other look like posion values. Still consistent with a use-after-free, with a netdev suspect.
(In reply to comment #53) > What's your ipaddr/netmask? [alex@srv64 linux-2.6.38-rc4]$ /sbin/ifconfig eth0 Link encap:Ethernet HWaddr 00:27:0E:02:59:2B inet addr:192.168.3.2 Bcast:192.168.3.255 Mask:255.255.255.0 inet6 addr: fe80::227:eff:fe02:592b/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:19917 errors:0 dropped:0 overruns:0 frame:0 TX packets:5932 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:5644305 (5.3 MiB) TX bytes:1018994 (995.1 KiB) Interrupt:44 Base address:0xe000 lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 inet6 addr: ::1/128 Scope:Host UP LOOPBACK RUNNING MTU:16436 Metric:1 RX packets:215 errors:0 dropped:0 overruns:0 frame:0 TX packets:215 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:10484 (10.2 KiB) TX bytes:10484 (10.2 KiB) virbr0 Link encap:Ethernet HWaddr E6:66:F3:14:5A:EA inet addr:192.168.122.1 Bcast:192.168.122.255 Mask:255.255.255.0 UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:176 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:0 (0.0 b) TX bytes:11842 (11.5 KiB)
Hi Alex, are you connected to a VPN when the oops happens? I've seen this twice recently on 2.6.35.11-83.fc14.x86_64.
(In reply to comment #56) > Hi Alex, are you connected to a VPN when the oops happens? I've seen this twice > recently on 2.6.35.11-83.fc14.x86_64. No. However, I have frequently seen the error shortly after using the graphical session after leaving Azureus (the bittorrent client) running all night.
I have upgraded from 2.6.38-rc4 to 2.6.38-rc8, and then to 2.6.38 stable, and I have not seen the graphics hang happening. I ran a whole week with -rc8 and two days (and counting) with stable. I must add, I removed the slab-allocate patch with no changes in stability (in 2.6.38 stable). However I still have slub_debug=FZPU in my kernel commandline. Periodic checking with slabinfo -v shows no corruption.
If you were a masochist, you could bisect and see where the bug lay. Hope springs eternal...
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.