Bug 29325 - [G45] Oops [*0xf] at drm_ht_insert_item+0x88
Summary: [G45] Oops [*0xf] at drm_ht_insert_item+0x88
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium critical
Assignee: Jesse Barnes
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-07-30 13:50 UTC by Alex Villacís Lasso
Modified: 2017-07-24 23:07 UTC (History)
4 users (show)

See Also:
i915 platform:
i915 features:


Attachments
Full dmesg with backtrace. (67.32 KB, text/plain)
2010-07-30 13:50 UTC, Alex Villacís Lasso
no flags Details
Configuration used to compile faulty kernel (96.98 KB, text/plain)
2010-07-30 13:58 UTC, Alex Villacís Lasso
no flags Details
lspci -v output for my machine (6.00 KB, text/plain)
2010-07-30 14:00 UTC, Alex Villacís Lasso
no flags Details
Kernel dmesg with backtrace. (122.51 KB, text/plain)
2010-08-11 11:09 UTC, Alex Villacís Lasso
no flags Details
Initial dmesg with debug kernel (124.01 KB, text/plain)
2010-08-12 09:47 UTC, Alex Villacís Lasso
no flags Details
dmesg with debug enabled and "Poison overwritten" backtrace (123.81 KB, text/plain)
2010-08-17 14:48 UTC, Alex Villacís Lasso
no flags Details
Backtrace with 2.6.36-rc4 (122.31 KB, text/plain)
2010-09-17 08:21 UTC, Alex Villacís Lasso
no flags Details
dmesg output with backtrace at the end for recompiled 2.6.36 (74.98 KB, text/plain)
2010-11-12 08:13 UTC, Alex Villacís Lasso
no flags Details
config file for recompiled 2.6.36 kernel (107.30 KB, application/octet-stream)
2010-11-12 08:17 UTC, Alex Villacís Lasso
no flags Details
Xorg output for recompiled 2.6.36 (184.21 KB, text/x-log)
2010-11-12 08:18 UTC, Alex Villacís Lasso
no flags Details
dmesg output with backtrace for 2.6.37 (123.61 KB, text/plain)
2011-01-10 10:13 UTC, Alex Villacís Lasso
no flags Details
Patch to add two BUG_ON on 2.6.38-rc1 (1.11 KB, patch)
2011-01-19 09:39 UTC, Alex Villacís Lasso
no flags Details | Splinter Review
dmesg output with messages for 2.6.38-rc1 (123.16 KB, text/plain)
2011-01-21 14:52 UTC, Alex Villacís Lasso
no flags Details
Xorg.0.log right after the crash on 2.6.38-rc1 . (166.35 KB, text/plain)
2011-01-21 14:53 UTC, Alex Villacís Lasso
no flags Details
dmesg output with messages for 2.6.38-rc2 (108.74 KB, text/plain)
2011-01-25 12:40 UTC, Alex Villacís Lasso
no flags Details
dmesg output with messages for 2.6.38-rc2 - new crash (84.09 KB, text/plain)
2011-01-26 07:58 UTC, Alex Villacís Lasso
no flags Details
Patch to add additional tracing to DRM hash table functions (1.64 KB, patch)
2011-01-26 09:18 UTC, Alex Villacís Lasso
no flags Details | Splinter Review
dmesg output with trace patch - "normal" case (98.34 KB, text/plain)
2011-01-26 09:20 UTC, Alex Villacís Lasso
no flags Details
A different version of the trace patch (2.68 KB, patch)
2011-02-02 11:18 UTC, Alex Villacís Lasso
no flags Details | Splinter Review
Slab allocate objects. (2.42 KB, patch)
2011-02-02 14:04 UTC, Chris Wilson
no flags Details | Splinter Review
Crash after slab-allocate patch (986.39 KB, text/plain)
2011-02-04 16:16 UTC, Alex Villacís Lasso
no flags Details
slabinfo -v report for vanilla 2.6.38-rc4 kernel (59.97 KB, text/plain)
2011-02-09 08:01 UTC, Alex Villacís Lasso
no flags Details
Another memory corruption report with i915 in separate slab (165.01 KB, text/plain)
2011-02-14 08:29 UTC, Alex Villacís Lasso
no flags Details

Description Alex Villacís Lasso 2010-07-30 13:50:59 UTC
Created attachment 37467 [details]
Full dmesg with backtrace.

System is Fedora 12 x86_64 with latest updates, KMS and Compiz. Affected kernel is 2.6.35-rc6 compiled from source. The intel driver is xorg-x11-drv-intel-2.9.1-1.fc12.x86_64 from RPM.

While testing for kernel bug 15664 (https://bugzilla.kernel.org/show_bug.cgi?id=15664), I got a hang after 7 days of uptime while scrolling down a Firefox window. When sshing from another location, I got the attached backtrace from dmesg. I was using vanilla 2.6.35-rc6 plus a "Repeat unbind during free" patch (https://bugzilla.kernel.org/attachment.cgi?id=27227) that was proposed as a possible fix for the original bug.
Comment 1 Alex Villacís Lasso 2010-07-30 13:58:48 UTC
Created attachment 37468 [details]
Configuration used to compile faulty kernel
Comment 2 Alex Villacís Lasso 2010-07-30 14:00:51 UTC
Created attachment 37469 [details]
lspci -v output for my machine
Comment 3 Chris Wilson 2010-07-30 14:44:39 UTC
Use-after-free still seems to be the simplest explanation. Alex please do add any other hangs you encounter, especially with the repeat-unbind patch [as that is a definite fix for one very rare use-after-free bug].
Comment 4 Alex Villacís Lasso 2010-08-02 09:07:33 UTC
Compiled 2.5.35 with repeat-unbind patch. Testing starting now.
Comment 5 Jesse Barnes 2010-08-10 14:13:04 UTC
Alex, is this one fixed with current bits + the unbind patch?
Comment 6 Alex Villacís Lasso 2010-08-10 15:20:32 UTC
(In reply to comment #5)
> Alex, is this one fixed with current bits + the unbind patch?

As this bug seems random, I am not sure whether I will hit this again. I have been running 2.6.35 plus repeat-unbind for 3 days without incident.
Comment 7 Jesse Barnes 2010-08-10 15:31:42 UTC
I'm an incorrigible optimist.
Comment 8 Alex Villacís Lasso 2010-08-11 11:08:10 UTC
Got the hang and backtrace again. Kernel is 2.6.35 with repeat-unbind patch. Full dmesg output follows.
Comment 9 Alex Villacís Lasso 2010-08-11 11:09:03 UTC
Created attachment 37793 [details]
Kernel dmesg with backtrace.
Comment 10 Alex Villacís Lasso 2010-08-11 11:09:59 UTC
This crash was triggered by repeatedly scrolling a Firefox window up and down with the mouse wheel. Again.
Comment 11 Chris Wilson 2010-08-11 14:28:25 UTC
Hmm, an identical trace. The chances of it being random corruption are reduced. Alex, can you build your kernel with as much debugging enabled as possible (mutex, spinlock, allocation, etc)? kmemcheck would be perfect but unfeasible to use on a desktop system.

The implication is still a use-after-free bug. It might just be related to

commit 86f100b136626e91f4f66f3776303475e2e58998
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sat Jul 24 21:03:49 2010 +0100

    drm/i915: Unreference object not handle on creation
    
    When creating an object, we create the handle by which it is known to
    the process and which own the reference to the object. That reference to
    the new handle is what we want to transfer to the process, not the lost
    reference to the object; so free the local object reference *not* the
    process's handle reference.
    
    This brings i915_gem_object_create_ioctl() into line with
    drm_gem_open_ioctl()
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Signed-off-by: Eric Anholt <eric@anholt.net>

but I admit to be clutching at straws.
Comment 12 Alex Villacís Lasso 2010-08-12 09:41:58 UTC
(In reply to comment #11)
> Hmm, an identical trace. The chances of it being random corruption are reduced.
> Alex, can you build your kernel with as much debugging enabled as possible
> (mutex, spinlock, allocation, etc)? kmemcheck would be perfect but unfeasible
> to use on a desktop system.
Done. My diff from the previous configuration (I hope this is enough):

--- linux-config-nodebug.config	2010-08-12 10:17:02.000000000 -0500
+++ linux-config-debug.config	2010-08-12 11:37:45.000000000 -0500
@@ -1,7 +1,7 @@
 #
 # Automatically generated make config: don't edit
 # Linux kernel version: 2.6.35
-# Mon Aug  2 10:08:54 2010
+# Thu Aug 12 10:17:46 2010
 #
 CONFIG_64BIT=y
 # CONFIG_X86_32 is not set
@@ -69,7 +69,7 @@
 CONFIG_LOCK_KERNEL=y
 CONFIG_INIT_ENV_ARG_LIMIT=32
 CONFIG_CROSS_COMPILE=""
-CONFIG_LOCALVERSION=""
+CONFIG_LOCALVERSION="-debug"
 # CONFIG_LOCALVERSION_AUTO is not set
 CONFIG_HAVE_KERNEL_GZIP=y
 CONFIG_HAVE_KERNEL_BZIP2=y
@@ -4003,8 +4003,13 @@
 CONFIG_SCHED_DEBUG=y
 CONFIG_SCHEDSTATS=y
 CONFIG_TIMER_STATS=y
-# CONFIG_DEBUG_OBJECTS is not set
-# CONFIG_SLUB_DEBUG_ON is not set
+CONFIG_DEBUG_OBJECTS=y
+# CONFIG_DEBUG_OBJECTS_SELFTEST is not set
+CONFIG_DEBUG_OBJECTS_FREE=y
+CONFIG_DEBUG_OBJECTS_TIMERS=y
+CONFIG_DEBUG_OBJECTS_WORK=y
+CONFIG_DEBUG_OBJECTS_ENABLE_DEFAULT=1
+CONFIG_SLUB_DEBUG_ON=y
 # CONFIG_SLUB_STATS is not set
 # CONFIG_DEBUG_KMEMLEAK is not set
 CONFIG_DEBUG_PREEMPT=y
@@ -4014,22 +4019,24 @@
 CONFIG_DEBUG_SPINLOCK=y
 CONFIG_DEBUG_MUTEXES=y
 CONFIG_DEBUG_LOCK_ALLOC=y
-# CONFIG_PROVE_LOCKING is not set
+CONFIG_PROVE_LOCKING=y
+# CONFIG_PROVE_RCU is not set
 CONFIG_LOCKDEP=y
 # CONFIG_LOCK_STAT is not set
 # CONFIG_DEBUG_LOCKDEP is not set
+CONFIG_TRACE_IRQFLAGS=y
 CONFIG_DEBUG_SPINLOCK_SLEEP=y
 # CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
 CONFIG_STACKTRACE=y
-# CONFIG_DEBUG_KOBJECT is not set
+CONFIG_DEBUG_KOBJECT=y
 CONFIG_DEBUG_BUGVERBOSE=y
 CONFIG_DEBUG_INFO=y
-# CONFIG_DEBUG_VM is not set
-# CONFIG_DEBUG_VIRTUAL is not set
+CONFIG_DEBUG_VM=y
+CONFIG_DEBUG_VIRTUAL=y
 # CONFIG_DEBUG_WRITECOUNT is not set
 CONFIG_DEBUG_MEMORY_INIT=y
 CONFIG_DEBUG_LIST=y
-# CONFIG_DEBUG_SG is not set
+CONFIG_DEBUG_SG=y
 CONFIG_DEBUG_NOTIFIERS=y
 # CONFIG_DEBUG_CREDENTIALS is not set
 CONFIG_ARCH_WANT_FRAME_POINTERS=y
@@ -4045,8 +4052,8 @@
 CONFIG_CPU_NOTIFIER_ERROR_INJECT=m
 # CONFIG_FAULT_INJECTION is not set
 CONFIG_LATENCYTOP=y
-# CONFIG_SYSCTL_SYSCALL_CHECK is not set
-# CONFIG_DEBUG_PAGEALLOC is not set
+CONFIG_SYSCTL_SYSCALL_CHECK=y
+CONFIG_DEBUG_PAGEALLOC=y
 CONFIG_USER_STACKTRACE_SUPPORT=y
 CONFIG_HAVE_FUNCTION_TRACER=y
 CONFIG_HAVE_FUNCTION_GRAPH_TRACER=y
Comment 13 Alex Villacís Lasso 2010-08-12 09:47:23 UTC
Created attachment 37819 [details]
Initial dmesg with debug kernel

Currently running debug kernel. I got the attached dmesg. Nothing yet about a backtrace. However I notice these weird messages:

[    9.234337] kobject: 'holders': free name
[    9.234340] kobject: 'intel_rng' (ffffffffa000e710): kobject_cleanup
[    9.234341] kobject: 'intel_rng' (ffffffffa000e710): does not have a release() function, it is broken and must be fixed.

[   10.839184] kobject: 'holders': free name
[   10.839186] kobject: 'scsi_wait_scan' (ffffffffa00c7090): kobject_cleanup
[   10.839188] kobject: 'scsi_wait_scan' (ffffffffa00c7090): does not have a release() function, it is broken and must be fixed.

[   14.300151] kobject: 'holders': free name
[   14.300154] kobject: 'microcode' (ffffffffa0128070): kobject_cleanup
[   14.300156] kobject: 'microcode' (ffffffffa0128070): does not have a release() function, it is broken and must be fixed.

They do not seem to be related to my problem, though.
Comment 14 Alex Villacís Lasso 2010-08-17 14:48:25 UTC
Created attachment 37928 [details]
dmesg with debug enabled and "Poison overwritten" backtrace

I got a "BUG kmalloc-512: Poison overwritten" report on 2.6.35. The dmesg report is attached. The i915_* functions appear on the backtrace.
Comment 15 Alex Villacís Lasso 2010-08-30 12:12:50 UTC
Testing since today with 2.6.36-rc3. No additional patches on top of vanilla kernel, but all previous debugging is enabled.
Comment 16 Chris Wilson 2010-09-02 08:35:26 UTC
(In reply to comment #14)
> Created an attachment (id=37928) [details]
> dmesg with debug enabled and "Poison overwritten" backtrace
> 
> I got a "BUG kmalloc-512: Poison overwritten" report on 2.6.35. The dmesg
> report is attached. The i915_* functions appear on the backtrace.

Unrelated to us. c0 93 11 2d 01 88 ff ff would appear to be an ip address and netmask, so the suspicion lies with the br code - please do forward them that stack trace.

And thanks for running with all the extra debugging.
Comment 17 Chris Wilson 2010-09-10 07:49:28 UTC
Still have no clue as to what's happening here.

Can you grab an oops from git://git.kernel.org/pub/scm/linux/kernel/git/ickle/drm-intel.git drm-intel-next so that (a) we verify that it still occurs and (b) so that I can match the disassembly to the source code?
Comment 18 Alex Villacís Lasso 2010-09-13 09:32:22 UTC
Could not (yet) trigger the hang during all tests with 2.6.36-rc3. Testing now with 2.6.36-rc4.
Comment 19 Alex Villacís Lasso 2010-09-17 08:21:29 UTC
Created attachment 38765 [details]
Backtrace with 2.6.36-rc4

Arrrhghh!!! I thought the -rc4 was free from the bug at last!
Comment 20 Chris Wilson 2010-09-22 09:18:24 UTC
I have stared long and hard at all the traces and I simply do not see how we get an invalid pointer deference there. Downgrading the priority since the bug seems to be quite rare.
Comment 21 Alex Villacís Lasso 2010-09-22 10:48:15 UTC
Kernel bug report for poison overwritten at https://bugzilla.kernel.org/show_bug.cgi?id=18842

Now testing for the bug with commit 8b15575cae7a93a784c3005c42b069edd9ba64dd
past 2.6.36-rc5 in mainline kernel... debug enabled as before.
Comment 22 Alex Villacís Lasso 2010-11-12 08:13:57 UTC
Created attachment 40238 [details]
dmesg output with backtrace at the end for recompiled 2.6.36
Comment 23 Alex Villacís Lasso 2010-11-12 08:17:47 UTC
Created attachment 40239 [details]
config file for recompiled 2.6.36 kernel
Comment 24 Alex Villacís Lasso 2010-11-12 08:18:33 UTC
Created attachment 40240 [details]
Xorg output for recompiled 2.6.36
Comment 25 Alex Villacís Lasso 2010-11-12 08:26:24 UTC
For a test, I took the kernel config file for the Fedora 14 x86_64 LiveCD, and ran "make oldconfig" for the vanilla 2.6.36 kernel. Unfortunately this recompiled kernel crashed too, but the messages shown are slightly different. I attached the dmesg with the backtrace, as well as the Xorg.0.log file, and the kernel configuration file after "make oldconfig". I see there are quite a few lines in the Xorg log that read "i830_uxa_prepare_access: bo map failed". In addition, I see about the same number of lines in the dmesg output that say "[drm:i915_gem_create_mmap_offset] *ERROR* failed to add to map hash", around 45 seconds before the crash. Does this add anything new?
Comment 26 Chris Wilson 2010-11-12 08:56:11 UTC
Those errors would be consistent would the earlier crash; both concern some memory corruption with map hash table.

I admit to still being puzzled.
Comment 27 Alex Villacís Lasso 2011-01-10 10:13:46 UTC
Created attachment 41854 [details]
dmesg output with backtrace for 2.6.37

Still present in vanilla 2.6.37.

Is there a way to add some integrity check to drm_ht_insert_item and friends, so that a hash map corruption might be detected as soon as possible? In this way I could correlate the warning with whatever activity is happening at the moment, to narrow down the causes. Better yet, a way to warn/prevent accesses to the memory occupied by the hash by code other than the hash functions.

Bug has now seeped into Fedora 14 stable - see https://bugzilla.redhat.com/show_bug.cgi?id=665887 .
Comment 28 Jesse Barnes 2011-01-17 10:37:31 UTC
Quick locking audit for the hash table.  The offset_hash is touched in drm_gem_init and drm_gem_destroy, both of which only happen at load/unload, so that's not the issue.

drm_gem_mmap, i915_gem_create_mmap_offset, and i915_gem_free_mmap_offset also touch this; the first doing lookups and the latter two doing insert and remove.

drm_gem_mmap takes the struct_mutex around the lookup.

i915_gem_create_mmap_offset also takes the lock through its caller, i915_gem_mmap_gtt_ioctl, so it's also safe.

i915_gem_free_mmap_offset is called by i915_gem_free_object_tail, which does i915_gem_object_unbind, so it had better be locked.  It's called by both i915_gem_retire_requests and i915_gem_free_object.  AFAICT those paths are protected by the struct_mutex as well, so it doesn't appear to be a race.


All the
[267379.215094] [drm:i915_gem_create_mmap_offset] *ERROR* failed to add to map hash
messages are concerning, it means the hash key didn't match, which certainly implies that the h_list got corrupted.  But there were quite a few free code paths, so maybe one really is dropping the lock?

You could try something like this:

--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -1331,6 +1331,8 @@ i915_gem_free_mmap_offset(struct drm_i915_gem_object *obj)
        struct drm_gem_mm *mm = dev->mm_private;
        struct drm_map_list *list = &obj->base.map_list;
 
+       BUG_ON(!mutex_is_locked(&dev->struct_mutex));
+
        drm_ht_remove_item(&mm->offset_hash, &list->hash);
        drm_mm_put_block(list->file_offset_node);
        kfree(list->map);

to confirm that.

It's also possible that hashed_key is bogus (see drm_ht_insert_item), so you could add a bounds check to make sure it's within the size of ht->table[].
Comment 29 Alex Villacís Lasso 2011-01-19 09:39:47 UTC
Created attachment 42203 [details] [review]
Patch to add two BUG_ON on 2.6.38-rc1

This patch contains the changes I made to vanilla 2.6.38-rc1. In addition to the suggested patch in the last comment, I also added a BUG_ON on drm_ht_insert_item with a check for hr->size. I will run 2.6.37-rc1 for as long as possible (until I get a crash, or -rc2 is available).
Comment 30 Alex Villacís Lasso 2011-01-21 14:52:31 UTC
Created attachment 42292 [details]
dmesg output with messages for 2.6.38-rc1

The message "[drm:i915_gem_create_mmap_offset] *ERROR* failed to add to map hash" appeared again, but now the failure mode is different. With 2.6.38-rc1 my entire graphical session crashed when I tried to scroll a listbox in Eclipse, but then the machine bounced back to the graphical login without hanging. The attached file contains the message log for the kernel. Notice that neither WARN_ON triggered in any moment. There were two separate episodes of message flood. The first one was from 84909.359537 to 84909.364938 and the second one was from 89647.299393 to 89649.534727 .
Comment 31 Alex Villacís Lasso 2011-01-21 14:53:28 UTC
Created attachment 42293 [details]
Xorg.0.log right after the crash on 2.6.38-rc1 .
Comment 32 Alex Villacís Lasso 2011-01-25 12:40:37 UTC
Created attachment 42485 [details]
dmesg output with messages for 2.6.38-rc2

Bug still present in 2.6.38-rc2. 

Now the crash has changed slightly. This time the message starts with:

[26443.152330] BUG: unable to handle kernel paging request at 0000100000000000
[26443.152393] IP: [<ffffffffa002dc80>] drm_ht_insert_item+0x62/0xb7 [drm]

As before, neither WARN_ON triggered before this crash.

From what little I can understand from reading the code, the message "...failed to add to map hash..." should only trigger when attempting to insert a duplicate key into a hash. I also gather that the "key" is actually a memory address returned as part of a structure from drm_mm_get_block_generic(). So the memory allocator is either returning the same address to multiple allocation requests, or the hash table is not being cleared of no-longer-used memory addresses. I am thinking about adding trace messages to dump the contents of the hashmap and trace messages to drm_ht_insert_item to see where in the implementation the crash occurs.

BTW, 0x0000100000000000 seems an invalid address, and the previous analysis does not tell me how the code gets to reference such an address.
Comment 33 Chris Wilson 2011-01-25 12:51:00 UTC
Did you ever check your machine with memtest86? But if it is bad memory then it is being incredibly specific all this time... I still don't understand how you appear to be the only person who suffers from this bug considering the apparent severity.
Comment 34 Alex Villacís Lasso 2011-01-25 14:39:10 UTC
(In reply to comment #33)
> Did you ever check your machine with memtest86? But if it is bad memory then it
> is being incredibly specific all this time... I still don't understand how you
> appear to be the only person who suffers from this bug considering the apparent
> severity.

Yes, I have checked with memtest. No hints of bad memory.

No, I am not the only person who has seen this bug. See https://bugzilla.redhat.com/show_bug.cgi?id=665887 comment 3.
Comment 35 Alex Villacís Lasso 2011-01-25 14:49:58 UTC
(In reply to comment #34)
> (In reply to comment #33)
> > Did you ever check your machine with memtest86? But if it is bad memory then it
> > is being incredibly specific all this time... I still don't understand how you
> > appear to be the only person who suffers from this bug considering the apparent
> > severity.
> 
> Yes, I have checked with memtest. No hints of bad memory.
> 
> No, I am not the only person who has seen this bug. See
> https://bugzilla.redhat.com/show_bug.cgi?id=665887 comment 3.

Also, these reports look like the same bug:
https://lists.linux-foundation.org/pipermail/bugme-new/2010-December/026319.html
http://kerneltrap.org/mailarchive/linux-kernel/2010/7/8/4591541
http://ubuntuforums.org/showthread.php?t=1385321
Comment 36 Alex Villacís Lasso 2011-01-26 07:58:04 UTC
Created attachment 42526 [details]
dmesg output with messages for 2.6.38-rc2 - new crash

I got another crash (with mutter/gnome-shell, apparently when hovering mouse pointer over thunderbird and displaying a tooltip) on 2.6.38-rc2. However this one is different from the previous ones.
Comment 37 Alex Villacís Lasso 2011-01-26 09:18:52 UTC
Created attachment 42536 [details] [review]
Patch to add additional tracing to DRM hash table functions

In order to assist in locating the bug, I created this patch. This patch is supposed to dump the state of the hash bucket for the selected key, for drm_ht_insert_item, drm_ht_remove*, and so on. After recompiling, I booted with drm.debug=1 and will work normally until a hang occurs.

BTW, what is the purpose of the 'fill' struct field of 'struct drm_open_hash' ? The remove functions decrement this variable, but I see no corresponding increment, and no other references to this variable in the drm code.
Comment 38 Alex Villacís Lasso 2011-01-26 09:20:05 UTC
Created attachment 42537 [details]
dmesg output with trace patch - "normal" case

Baseline situation with machine working "normally", and trace patch active.
Comment 39 Chris Wilson 2011-01-26 10:47:39 UTC
One thing to try is whether

diff --git a/drivers/gpu/drm/drm_gem.c b/drivers/gpu/drm/drm_gem.c
index ea1c4b0..cd54e67 100644
--- a/drivers/gpu/drm/drm_gem.c
+++ b/drivers/gpu/drm/drm_gem.c
@@ -101,7 +101,7 @@ drm_gem_init(struct drm_device *dev)
 
        dev->mm_private = mm;
 
-       if (drm_ht_create(&mm->offset_hash, 19)) {
+       if (drm_ht_create(&mm->offset_hash, 9)) {
                kfree(mm);
                return -ENOMEM;
        }

has any affect on the frequency of this bug.
Comment 40 Alex Villacís Lasso 2011-02-02 11:18:59 UTC
Created attachment 42864 [details] [review]
A different version of the trace patch

This version of the patch makes drm_ht_verbose_list() and the traces use DRM_INFO instead of DRM_DEBUG for the messages. This is required as there is quite a bit of output, and drm.debug=1 causes mangled output in /var/log/messages, apparently because it is generated way faster than what syslog can handle.
Comment 41 Alex Villacís Lasso 2011-02-02 11:37:14 UTC
I have news on this bug.

I am now testing on 2.6.38-rc3, and I have witnessed an interesting hash map corruption by means of the latest trace patch:

After running for one day and leaving Vuze running overnight, I resumed work today, and was alerted of a WARN_ON from my patch. At time 73152.321693, drm tried to insert key 0x001306ec which hashed to 0x00000057. At that time there were 4 items in the hash bucket:

Feb  2 11:57:18 srv64 kernel: [73152.321693] [drm] drm_ht_insert_item: before insert, fill=4294684641
Feb  2 11:57:18 srv64 kernel: [73152.321697] [drm] Key is 0x001306ec, Hashed key is 0x00000057
Feb  2 11:57:18 srv64 kernel: [73152.321702] [drm] count 0, key: 0x001260d1 list: 0xffff880037a10828 list->next: 0xffff880037af9428
Feb  2 11:57:18 srv64 kernel: [73152.321706] [drm] count 1, key: 0x0012ae80 list: 0xffff880037af9428 list->next: 0xffff88000434ce28
Feb  2 11:57:18 srv64 kernel: [73152.321710] [drm] count 2, key: 0x0012bcbf list: 0xffff88000434ce28 list->next: 0xffff88011bbbb028
Feb  2 11:57:18 srv64 kernel: [73152.321714] [drm] count 3, key: 0x00130e80 list: 0xffff88011bbbb028 list->next: 0x          (null)
Feb  2 11:57:18 srv64 kernel: [73152.321717] [drm] drm_ht_insert_item: after insert, fill=4294684641
Feb  2 11:57:18 srv64 kernel: [73152.321720] [drm] Key is 0x001306ec, Hashed key is 0x00000057
Feb  2 11:57:18 srv64 kernel: [73152.321724] [drm] count 0, key: 0x001260d1 list: 0xffff880037a10828 list->next: 0xffff880037af9428
Feb  2 11:57:18 srv64 kernel: [73152.321728] [drm] count 1, key: 0x0012ae80 list: 0xffff880037af9428 list->next: 0xffff88000434ce28
Feb  2 11:57:18 srv64 kernel: [73152.321732] [drm] count 2, key: 0x0012bcbf list: 0xffff88000434ce28 list->next: 0xffff8800378e6a28
Feb  2 11:57:18 srv64 kernel: [73152.321737] [drm] count 3, key: 0x001306ec list: 0xffff8800378e6a28 list->next: 0xffff88011bbbb028
Feb  2 11:57:18 srv64 kernel: [73152.321741] [drm] count 4, key: 0x00130e80 list: 0xffff88011bbbb028 list->next: 0x          (null)


Later, at 73206.237985, drm tried to remove key 0x001306ec. However, the linked list was now corrupted:

Feb  2 11:58:12 srv64 kernel: [73206.237985] [drm] drm_ht_remove_item: before remove, fill=4294683133
Feb  2 11:58:12 srv64 kernel: [73206.237987] [drm] Key is 0x001306ec, Hashed key is 0x00000057
Feb  2 11:58:12 srv64 kernel: [73206.237990] [drm] count 0, key: 0x001260d1 list: 0xffff880037a10828 list->next: 0xffff880037af9428
Feb  2 11:58:12 srv64 kernel: [73206.237992] [drm] count 1, key: 0x0012ae80 list: 0xffff880037af9428 list->next: 0xffff88000c7abc20
Feb  2 11:58:12 srv64 kernel: [73206.237994] [drm] count 2, key: 0xffff8801365166c0 list: 0xffff88000c7abc20 list->next: 0xffff88000c7abe20
Feb  2 11:58:12 srv64 kernel: [73206.237997] [drm] count 3, key: 0xffff8801365166c0 list: 0xffff88000c7abe20 list->next: 0xffff88000c7abc20
Feb  2 11:58:12 srv64 kernel: [73206.237999] [drm] count 4, key: 0xffff8801365166c0 list: 0xffff88000c7abc20 list->next: 0xffff88000c7abe20
Feb  2 11:58:12 srv64 kernel: [73206.238005] [drm] count 5, key: 0xffff8801365166c0 list: 0xffff88000c7abe20 list->next: 0xffff88000c7abc20
Feb  2 11:58:12 srv64 kernel: [73206.238008] [drm] count 6, key: 0xffff8801365166c0 list: 0xffff88000c7abc20 list->next: 0xffff88000c7abe20
Feb  2 11:58:12 srv64 kernel: [73206.238010] [drm] count 7, key: 0xffff8801365166c0 list: 0xffff88000c7abe20 list->next: 0xffff88000c7abc20
Feb  2 11:58:12 srv64 kernel: [73206.238013] [drm] count 8, key: 0xffff8801365166c0 list: 0xffff88000c7abc20 list->next: 0xffff88000c7abe20
Feb  2 11:58:12 srv64 kernel: [73206.238015] [drm] count 9, key: 0xffff8801365166c0 list: 0xffff88000c7abe20 list->next: 0xffff88000c7abc20
Feb  2 11:58:12 srv64 kernel: [73206.238018] [drm] count 10, key: 0xffff8801365166c0 list: 0xffff88000c7abc20 list->next: 0xffff88000c7abe20
Feb  2 11:58:12 srv64 kernel: [73206.238021] [drm] count 11, key: 0xffff8801365166c0 list: 0xffff88000c7abe20 list->next: 0xffff88000c7abc20
Feb  2 11:58:12 srv64 kernel: [73206.238023] [drm] count 12, key: 0xffff8801365166c0 list: 0xffff88000c7abc20 list->next: 0xffff88000c7abe20
Feb  2 11:58:12 srv64 kernel: [73206.238026] [drm] count 13, key: 0xffff8801365166c0 list: 0xffff88000c7abe20 list->next: 0xffff88000c7abc20
Feb  2 11:58:12 srv64 kernel: [73206.238028] [drm] count 14, key: 0xffff8801365166c0 list: 0xffff88000c7abc20 list->next: 0xffff88000c7abe20
Feb  2 11:58:12 srv64 kernel: [73206.238030] [drm] count 15, key: 0xffff8801365166c0 list: 0xffff88000c7abe20 list->next: 0xffff88000c7abc20
Feb  2 11:58:12 srv64 kernel: [73206.238032] ------------[ cut here ]------------
Feb  2 11:58:12 srv64 kernel: [73206.238061] WARNING: at drivers/gpu/drm/drm_hashtab.c:81 drm_ht_verbose_list+0x59/0x96 [drm]()
Feb  2 11:58:12 srv64 kernel: [73206.238063] Hardware name:         
Feb  2 11:58:12 srv64 kernel: [73206.238065] Modules linked in: tcp_lp bluetooth rfkill fuse ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat bridge stp llc deflate zlib_deflate ctr camellia cast5 rmd160 crypto_null ccm serpent blowfish twofish_x86_64 twofish_common ecb xcbc cbc sha256_generic sha512_generic des_generic cryptd aes_x86_64 aes_generic ah6 ah4 esp6 esp4 xfrm4_mode_beet xfrm4_tunnel tunnel4 xfrm4_mode_tunnel xfrm4_mode_transport xfrm6_mode_transport xfrm6_mode_ro xfrm6_mode_beet xfrm6_mode_tunnel ipcomp ipcomp6 xfrm_ipcomp xfrm6_tunnel tunnel6 af_key vboxnetadp vboxnetflt vboxdrv sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf capi capifs kernelcapi ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables ipv6 kvm_intel kvm uinput snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer ppdev parport_pc snd iTCO_wdt iTCO_vendor_support soundcore snd_page_alloc r8169 pcspkr parport i2c_i801 mii microcode i915 drm_kms_helper drm i
Feb  2 11:58:12 srv64 kernel: 2c_algo_bit i2c_core video [last unloaded: scsi_wait_scan]
Feb  2 11:58:12 srv64 kernel: [73206.238138] Pid: 1975, comm: Xorg Not tainted 2.6.38-rc3 #5
Feb  2 11:58:12 srv64 kernel: [73206.238139] Call Trace:
Feb  2 11:58:12 srv64 kernel: [73206.238146]  [<ffffffff810515ef>] ? warn_slowpath_common+0x85/0x9d
Feb  2 11:58:12 srv64 kernel: [73206.238150]  [<ffffffff81051621>] ? warn_slowpath_null+0x1a/0x1c
Feb  2 11:58:12 srv64 kernel: [73206.238159]  [<ffffffffa002dcf4>] ? drm_ht_verbose_list+0x59/0x96 [drm]
Feb  2 11:58:12 srv64 kernel: [73206.238164]  [<ffffffff811212cd>] ? lookup_page_cgroup+0x36/0x4c
Feb  2 11:58:12 srv64 kernel: [73206.238174]  [<ffffffffa002dd69>] ? drm_ht_remove_item+0x38/0x8e [drm]
Feb  2 11:58:12 srv64 kernel: [73206.238177]  [<ffffffff811212cd>] ? lookup_page_cgroup+0x36/0x4c
Feb  2 11:58:12 srv64 kernel: [73206.238191]  [<ffffffffa007a61b>] ? i915_gem_free_object_tail+0x78/0xd3 [i915]
Feb  2 11:58:12 srv64 kernel: [73206.238199]  [<ffffffffa002707e>] ? drm_gem_object_free+0x0/0x2d [drm]
Feb  2 11:58:12 srv64 kernel: [73206.238210]  [<ffffffffa007c7f3>] ? i915_gem_free_object+0x80/0x88 [i915]
Feb  2 11:58:12 srv64 kernel: [73206.238218]  [<ffffffffa002707e>] ? drm_gem_object_free+0x0/0x2d [drm]
Feb  2 11:58:12 srv64 kernel: [73206.238226]  [<ffffffffa00270a9>] ? drm_gem_object_free+0x2b/0x2d [drm]
Feb  2 11:58:12 srv64 kernel: [73206.238230]  [<ffffffff81231e5e>] ? kref_put+0x43/0x4d
Feb  2 11:58:12 srv64 kernel: [73206.238238]  [<ffffffffa0026fad>] ? drm_gem_object_unreference_unlocked+0x33/0x40 [drm]
Feb  2 11:58:12 srv64 kernel: [73206.238246]  [<ffffffffa002705a>] ? drm_gem_object_handle_unreference_unlocked+0x35/0x39 [drm]
Feb  2 11:58:12 srv64 kernel: [73206.238255]  [<ffffffffa00273ef>] ? drm_gem_close_ioctl+0x79/0x84 [drm]
Feb  2 11:58:12 srv64 kernel: [73206.238263]  [<ffffffffa00258b8>] ? drm_ioctl+0x2a8/0x385 [drm]
Feb  2 11:58:12 srv64 kernel: [73206.238267]  [<ffffffff811f9763>] ? inode_has_perm.clone.19+0x76/0x8c
Feb  2 11:58:12 srv64 kernel: [73206.238275]  [<ffffffffa0027376>] ? drm_gem_close_ioctl+0x0/0x84 [drm]
Feb  2 11:58:12 srv64 kernel: [73206.238279]  [<ffffffff81048a9a>] ? get_parent_ip+0x11/0x41
Feb  2 11:58:12 srv64 kernel: [73206.238282]  [<ffffffff81048a9a>] ? get_parent_ip+0x11/0x41
Feb  2 11:58:12 srv64 kernel: [73206.238286]  [<ffffffff81481005>] ? sub_preempt_count+0x97/0xaa
Feb  2 11:58:12 srv64 kernel: [73206.238290]  [<ffffffff81113665>] ? __bit_spin_unlock.clone.42+0x2c/0x37
Feb  2 11:58:12 srv64 kernel: [73206.238293]  [<ffffffff811f981e>] ? file_has_perm+0xa5/0xc7
Feb  2 11:58:12 srv64 kernel: [73206.238296]  [<ffffffff811333a7>] ? do_vfs_ioctl+0x4a4/0x4e5
Feb  2 11:58:12 srv64 kernel: [73206.238300]  [<ffffffff8113343e>] ? sys_ioctl+0x56/0x7c
Feb  2 11:58:12 srv64 kernel: [73206.238303]  [<ffffffff8100ac82>] ? system_call_fastpath+0x16/0x1b
Feb  2 11:58:12 srv64 kernel: [73206.238306] ---[ end trace d10b7b5570a59e52 ]---


Notice the weird pattern:
[drm] count 0, key: 0x001260d1 list: 0xffff880037a10828 list->next: 0xffff880037af9428
[drm] count 1, key: 0x0012ae80 list: 0xffff880037af9428 list->next: 0xffff88000c7abc20
[drm] count 2, key: 0xffff8801365166c0 list: 0xffff88000c7abc20 list->next: 0xffff88000c7abe20
[drm] count 3, key: 0xffff8801365166c0 list: 0xffff88000c7abe20 list->next: 0xffff88000c7abc20
[drm] count 4, key: 0xffff8801365166c0 list: 0xffff88000c7abc20 list->next: 0xffff88000c7abe20

The node at count 1 got its "next" pointer corrupted with the value 0xffff88000c7abc20, which then points to memory that, when interpreted as nodes, decodes to two nodes that happen to point to each other in a loop. The trace patch detects this and prints a WARN_ON. I have the full log available (74 MB after gzip), but I think the bugtracker will not accept it.

The weirdest thing is that the machine is still running as I type this - no hangs (yet).

What do you think about this?
Comment 42 Alex Villacís Lasso 2011-02-02 11:41:10 UTC
Full messages being uploaded at:

http://srv25.palosanto.com/palosanto.com/~a_villacis/kernel-messages-2.6.38-rc3.txt.gz

Complete file size is 76559430 bytes.
Comment 43 Chris Wilson 2011-02-02 14:03:29 UTC
It means something else scribbled over a live object in that 50s interval whilst your back was turned. Use-after-free somewhere else in the kernel.
Comment 44 Chris Wilson 2011-02-02 14:04:47 UTC
Created attachment 42872 [details] [review]
Slab allocate objects.

So if I am right in that some other layer is scribbling over old memory, then allocating the objects from a separate slab should prevent that...
Comment 45 Alex Villacís Lasso 2011-02-02 15:48:41 UTC
(In reply to comment #44)
> Created an attachment (id=42872) [details]
> Slab allocate objects.
> 
> So if I am right in that some other layer is scribbling over old memory, then
> allocating the objects from a separate slab should prevent that...

By "some other layer", do you mean somewhere higher up in the intel driver? The corruption happened while working in the graphics session in gnome-shell. Besides that clue, is there a hint in the corrupted values that points specifically at the files you touch with the patch?
Comment 46 Chris Wilson 2011-02-02 15:53:56 UTC
Actually, I meant some other bit of kernel code that shares the same kmalloc cache as struct drm_i915_gem_object.

That debug log does not reveal where the corruption came from, but does tell us that it has nothing to do with the mmap_offset ht.
Comment 47 Alex Villacís Lasso 2011-02-02 16:23:45 UTC
(In reply to comment #46)
> Actually, I meant some other bit of kernel code that shares the same kmalloc
> cache as struct drm_i915_gem_object.
> 
> That debug log does not reveal where the corruption came from, but does tell us
> that it has nothing to do with the mmap_offset ht.

Then, is there a way to query which portions of the kernel share(d) allocations with struct drm_i915_gem_object ? This would narrow down the suspects.
Comment 48 Chris Wilson 2011-02-03 01:56:29 UTC
Turn on slab debugging. That should give you some directories under /sys/kernel/debug/ that describe the objects allocated from each of the caches (including the callers).
Comment 49 Alex Villacís Lasso 2011-02-03 07:54:18 UTC
(In reply to comment #48)
> Turn on slab debugging. That should give you some directories under
> /sys/kernel/debug/ that describe the objects allocated from each of the caches
> (including the callers).

I rebooted with "slub_debug=FZP", but I still get nothing under /sys/kernel/debug/ Was this the right thing to do?
Comment 50 Alex Villacís Lasso 2011-02-04 16:16:56 UTC
Created attachment 42956 [details]
Crash after slab-allocate patch

After running a day-long session with gnome-shell, I logged out of the session, then logged in again. After the desktop finished loading, I tried moving the mouse, only to find that the session froze. After a minute or so, the PC beeped but remained with a frozen display. I sshd into the machine and got the backtrace attached here. Now it is a different place and call stack...

Again I ask, was the slub_debug  kernel parameter enough, or should I do something else, like recompiling the kernel with some debug option?
Comment 51 Alex Villacís Lasso 2011-02-09 08:01:14 UTC
Created attachment 43166 [details]
slabinfo -v report for vanilla 2.6.38-rc4 kernel

Currently running 2.6.38-rc4 with slub_debug=FZP . This morning, after leaving the computer on all night, I ran slabinfo -v. The attached report appeared in /var/log/messages. I ran slabinfo -v a few times, and collected various reports attached in the file.
Comment 52 Alex Villacís Lasso 2011-02-14 08:29:21 UTC
Created attachment 43340 [details]
Another memory corruption report with i915 in separate slab

I recompiled the kernel with the patch to allocate objects in a separate slab, and rebooted with slub_debug=FZPU . After a weekend of running Vuze the bittorrent client, I found the attached memory corruptions. The very first one in the file was found without me using any commands. The rest were found with slabinfo -v.

Is this enough evidence to conclude that some network code is corrupting memory? Or could this be just the network data being corrupted in the place of i915?
Comment 53 Chris Wilson 2011-02-14 08:34:27 UTC
What's your ipaddr/netmask?
Comment 54 Chris Wilson 2011-02-14 08:37:27 UTC
Only the first instance looks like an ipaddr/netmask. The other look like posion values. Still consistent with a use-after-free, with a netdev suspect.
Comment 55 Alex Villacís Lasso 2011-02-14 08:56:14 UTC
(In reply to comment #53)
> What's your ipaddr/netmask?

[alex@srv64 linux-2.6.38-rc4]$ /sbin/ifconfig
eth0      Link encap:Ethernet  HWaddr 00:27:0E:02:59:2B  
          inet addr:192.168.3.2  Bcast:192.168.3.255  Mask:255.255.255.0
          inet6 addr: fe80::227:eff:fe02:592b/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:19917 errors:0 dropped:0 overruns:0 frame:0
          TX packets:5932 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:5644305 (5.3 MiB)  TX bytes:1018994 (995.1 KiB)
          Interrupt:44 Base address:0xe000 

lo        Link encap:Local Loopback  
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:16436  Metric:1
          RX packets:215 errors:0 dropped:0 overruns:0 frame:0
          TX packets:215 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:10484 (10.2 KiB)  TX bytes:10484 (10.2 KiB)

virbr0    Link encap:Ethernet  HWaddr E6:66:F3:14:5A:EA  
          inet addr:192.168.122.1  Bcast:192.168.122.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:176 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0 
          RX bytes:0 (0.0 b)  TX bytes:11842 (11.5 KiB)
Comment 56 Harshula Jayasuriya 2011-03-04 04:30:18 UTC
Hi Alex, are you connected to a VPN when the oops happens? I've seen this twice recently on 2.6.35.11-83.fc14.x86_64.
Comment 57 Alex Villacís Lasso 2011-03-09 08:36:54 UTC
(In reply to comment #56)
> Hi Alex, are you connected to a VPN when the oops happens? I've seen this twice
> recently on 2.6.35.11-83.fc14.x86_64.

No. However, I have frequently seen the error shortly after using the graphical session after leaving Azureus (the bittorrent client) running all night.
Comment 58 Alex Villacís Lasso 2011-03-18 11:20:37 UTC
I have upgraded from 2.6.38-rc4 to 2.6.38-rc8, and then to 2.6.38 stable, and I have not seen the graphics hang happening. I ran a whole week with -rc8 and two days (and counting) with stable. I must add, I removed the slab-allocate patch with no changes in stability (in 2.6.38 stable). However I still have slub_debug=FZPU in my kernel commandline. Periodic checking with slabinfo -v shows no corruption.
Comment 59 Chris Wilson 2011-03-18 12:00:01 UTC
If you were a masochist, you could bisect and see where the bug lay.

Hope springs eternal...


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.