Bug 26521

Summary: Bug in kernel module - ttm_bo_pci_offset
Product: xorg Reporter: Alexander <alex.vizor>
Component: Driver/nouveauAssignee: Nouveau Project <nouveau>
Status: RESOLVED FIXED QA Contact: Xorg Project Team <xorg-team>
Severity: normal    
Priority: medium CC: bram, brian, hramrach
Version: 7.4 (2008.09)   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
kernel log
none
kernel log
none
dmesg log
none
crash in nouveau_ttm_io_mem_reserve (2.6.35-rc2+)
none
workaround
none
alternative fix by Francisco Jerez none

Description Alexander 2010-02-10 12:52:31 UTC
Created attachment 33225 [details]
kernel log

On 2.6.33-rcX s2disk works sometimes, sometimes not. When I tested it as described it kernel documentation - it worked all time, I tried it about 9 times with different debugging options and without them. But when I used laptop for half of hour and then tried to do s2disk it hanged on suspending after message "Suspending console" or something like this. One more note about testing I tested immediately after loading in single user mode and immediately after loading X server.

Last kernel log, with stack-trace and NULL pointer error, I got after resume from s2ram.
Comment 1 Alexander 2010-02-10 12:53:49 UTC
Forgot to mention more information you can find on http://bugzilla.kernel.org/show_bug.cgi?id=15120
Comment 2 Alexander 2010-02-16 05:51:44 UTC
Did anybody noticed this report?
Comment 3 Alexander 2010-02-17 01:17:23 UTC
Created attachment 33354 [details]
kernel log

That is kernel log of bug reproduced today on 2.6.33-rc8 kernel compiled with full debug
Comment 4 Alexander 2010-02-17 01:39:07 UTC
About full debugging kernel in last post, I compiled kernel with:
- Compile the kernel with debug info
- Compile the kernel with frame pointers
from "Kernel hacking" section in menuconfig.

Before 2.6.33-rcX I used nvidia blob and it always works good with suspend to ram and never resume from suspend to disk. I didn't use nouveau before 2.6.33-rcX so I can't say worked it or not for me earlier.

Starting from 2.6.33-rcX I'm using nouveau included in kernel. And I have some troubles with suspending and resuming (with both to ram and to disk). 

About suspend to disk - I tested it as described in kernel documentation immediately after system start without X and with it, with different debug option and all worked just well. Then I worked for an hour and tried to to suspend to disk and laptop hanged up. I'm using KDE with effects through XRender may be this can affect.

About suspend to ram - I don't know how to reproduce this bug, it appears randomly, ones ten, and actually that is all I can say about it the rest in kernel log :)
Comment 5 Alexander 2010-02-23 09:42:25 UTC
So, any fixes ideas? May be I should provide more information?
Comment 6 Maarten Maathuis 2010-02-23 10:00:46 UTC
It would help to know the specific code that is causing it. gdb 'info line' is your friend, you will need the ttm.ko module or the vmlinux kernel image (if it's built in). A short code snippet to avoid confusion would be nice too (all development happens against a special kernel tree).
Comment 7 Michal Suchanek 2010-03-21 09:15:28 UTC
Created attachment 34299 [details]
dmesg log

I got very similar error on 2.6.34 though with slightly different trace.

Unfortunately I could not reproduce it since I built a debug kernel.
Comment 8 Marcin Slusarz 2010-06-08 13:57:20 UTC
It seems ttm code was heavily changed during 2.6.35 development and the function we crash in (ttm_bo_pci_offset) was removed.

We need a new trace probably...
Comment 9 Alexander 2010-06-09 00:52:30 UTC
I've tested 2.6.35-rc1 and -rc2 and looks like it works. At least I haven't got this error since last to weeks
Comment 10 Marcin Slusarz 2010-06-09 13:20:58 UTC
it's not:

PM: resume of devices complete after 5330.459 msecs
[drm] nouveau 0000:02:00.0: PGRAPH_DATA_ERROR - Ch 1/0 Class 0x5039 Mthd 0x0318 Data 0x00001000:0x00084814
[drm] nouveau 0000:02:00.0: PGRAPH_DATA_ERROR - INVALID_VALUE
BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
IP: [<ffffffffa00a4faf>] nouveau_ttm_io_mem_reserve+0x8f/0xba [nouveau]
PGD 1bf29d067 PUD 1bf290067 PMD 0 
CPU 3 
Modules linked in: nouveau ttm drm_kms_helper snd_hda_codec_realtek snd_hda_intel snd_hda_codec

Pid: 2546, comm: X Not tainted 2.6.35-rc2+ #362 P6T SE/System Product Name
RIP: 0010:[<ffffffffa00a4faf>]  [<ffffffffa00a4faf>] nouveau_ttm_io_mem_reserve+0x8f/0xba [nouveau]
RSP: 0018:ffff8801be7cf908  EFLAGS: 00010246
RAX: ffff8801be5eb000 RBX: ffff8801be7cfb58 RCX: ffff8801b7180000
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8801b7180118
RBP: ffff8801be7cf918 R08: 800000000000016b R09: 0000000000000001
R10: 00000000012e0001 R11: ffff8800d12ee000 R12: ffff8801b7180118
R13: 0000000000000000 R14: ffff8801be7cfa18 R15: 0000000000000000
FS:  00007f857faf4840(0000) GS:ffff8800026c0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000028 CR3: 00000001be7a8000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process X (pid: 2546, threadinfo ffff8801be7ce000, task ffff8801bdc69f20)
 ffff8801be7cfa20 ffff8801be7cfb58 ffff8801be7cf928 ffffffffa0084761
<0> 0000000000000000 ffff8801b7180118 ffff8801be7cfa58 ffffffffa0084f18
 [<ffffffffa0084761>] ttm_mem_io_reserve+0x17/0x19 [ttm]
 [<ffffffffa008482f>] ttm_mem_reg_ioremap+0x20/0x81 [ttm]
 [<ffffffffa0084f18>] ttm_bo_move_memcpy+0x6e/0x40a [ttm]
 [<ffffffffa00a64f8>] ? nouveau_fence_wait+0x4f/0xaf [nouveau]
 [<ffffffffa00a630f>] ? nouveau_fence_unref+0x24/0x2f [nouveau]
 [<ffffffffa0082caf>] ? ttm_bo_reserve+0x33/0xf3 [ttm]
 [<ffffffffa00a5a9a>] nouveau_bo_move+0x183/0x207 [nouveau]
 [<ffffffffa0082121>] ttm_bo_handle_move_mem+0x1b4/0x2c0 [ttm]
 [<ffffffffa0083fe6>] ttm_bo_move_buffer+0xe6/0x13f [ttm]
 [<ffffffffa00840ea>] ttm_bo_validate+0xab/0xf4 [ttm]
 [<ffffffffa00a6c3a>] validate_list+0x145/0x28e [nouveau]
 [<ffffffffa00a7dde>] nouveau_gem_ioctl_pushbuf+0xf0d/0xf39 [nouveau]
 [<ffffffff8129a4e2>] drm_ioctl+0x27b/0x347
 [<ffffffffa00a6ed1>] ? nouveau_gem_ioctl_pushbuf+0x0/0xf39 [nouveau]
 [<ffffffff8110c001>] vfs_ioctl+0x2d/0xa1
 [<ffffffff8110c54d>] do_vfs_ioctl+0x454/0x48d
 [<ffffffff810fef91>] ? fget_light+0xa4/0x289
 [<ffffffff8110c5c8>] sys_ioctl+0x42/0x65
 [<ffffffff8102fd6b>] system_call_fastpath+0x16/0x1b
RIP  [<ffffffffa00a4faf>] nouveau_ttm_io_mem_reserve+0x8f/0xba [nouveau]
 RSP <ffff8801be7cf908>
Comment 11 Marcin Slusarz 2010-06-09 13:22:24 UTC
Created attachment 36181 [details]
crash in nouveau_ttm_io_mem_reserve (2.6.35-rc2+)
Comment 12 Marcin Slusarz 2010-06-13 14:11:16 UTC
For some reason I can reliably reproduce this bug, so I've added some BUG_ON's/WARN_ON's and it revealed that:

- it crashes because of mem->mm_node being NULL (at least when mem->mem_type == TTM_PL_VRAM)
- it comes through 2nd ttm_mem_reg_ioremap in ttm_bo_move_memcpy, so it is new_mem which mm_node == NULL
- it comes through 2nd ttm_bo_move_memcpy in nouveau_bo_move, so some earlier nouveau_bo_move_* failed, which can only happen in case of some hardware failure (lockup or somthing)
Comment 13 Marcin Slusarz 2010-06-28 13:57:09 UTC
Created attachment 36585 [details] [review]
workaround

It "fixes" an oops but it doesn't address its cause - gpu lockup.
Any other idea how to fix this?
Comment 14 Marcin Slusarz 2010-06-28 14:13:18 UTC
*** Bug 27574 has been marked as a duplicate of this bug. ***
Comment 15 Marcin Slusarz 2010-07-04 03:17:40 UTC
Created attachment 36734 [details] [review]
alternative fix by Francisco Jerez
Comment 16 Marcin Slusarz 2010-07-06 08:58:35 UTC
fixed by commit 8e530bd62f9eff097caf8e8546fa85f865928b78 "drm/nouveau: Move the fence wait before migration resource clean-up."
Comment 17 Marcin Slusarz 2010-07-13 08:29:05 UTC
*** Bug 29039 has been marked as a duplicate of this bug. ***
Comment 18 Brian Tarricone 2010-11-20 02:48:00 UTC
*** Bug 27036 has been marked as a duplicate of this bug. ***

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.