Summary: | Bug in kernel module - ttm_bo_pci_offset | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | xorg | Reporter: | Alexander <alex.vizor> | ||||||||||||||
Component: | Driver/nouveau | Assignee: | Nouveau Project <nouveau> | ||||||||||||||
Status: | RESOLVED FIXED | QA Contact: | Xorg Project Team <xorg-team> | ||||||||||||||
Severity: | normal | ||||||||||||||||
Priority: | medium | CC: | bram, brian, hramrach | ||||||||||||||
Version: | 7.4 (2008.09) | ||||||||||||||||
Hardware: | x86-64 (AMD64) | ||||||||||||||||
OS: | Linux (All) | ||||||||||||||||
Whiteboard: | |||||||||||||||||
i915 platform: | i915 features: | ||||||||||||||||
Attachments: |
|
Forgot to mention more information you can find on http://bugzilla.kernel.org/show_bug.cgi?id=15120 Did anybody noticed this report? Created attachment 33354 [details]
kernel log
That is kernel log of bug reproduced today on 2.6.33-rc8 kernel compiled with full debug
About full debugging kernel in last post, I compiled kernel with: - Compile the kernel with debug info - Compile the kernel with frame pointers from "Kernel hacking" section in menuconfig. Before 2.6.33-rcX I used nvidia blob and it always works good with suspend to ram and never resume from suspend to disk. I didn't use nouveau before 2.6.33-rcX so I can't say worked it or not for me earlier. Starting from 2.6.33-rcX I'm using nouveau included in kernel. And I have some troubles with suspending and resuming (with both to ram and to disk). About suspend to disk - I tested it as described in kernel documentation immediately after system start without X and with it, with different debug option and all worked just well. Then I worked for an hour and tried to to suspend to disk and laptop hanged up. I'm using KDE with effects through XRender may be this can affect. About suspend to ram - I don't know how to reproduce this bug, it appears randomly, ones ten, and actually that is all I can say about it the rest in kernel log :) So, any fixes ideas? May be I should provide more information? It would help to know the specific code that is causing it. gdb 'info line' is your friend, you will need the ttm.ko module or the vmlinux kernel image (if it's built in). A short code snippet to avoid confusion would be nice too (all development happens against a special kernel tree). Created attachment 34299 [details]
dmesg log
I got very similar error on 2.6.34 though with slightly different trace.
Unfortunately I could not reproduce it since I built a debug kernel.
It seems ttm code was heavily changed during 2.6.35 development and the function we crash in (ttm_bo_pci_offset) was removed. We need a new trace probably... I've tested 2.6.35-rc1 and -rc2 and looks like it works. At least I haven't got this error since last to weeks it's not: PM: resume of devices complete after 5330.459 msecs [drm] nouveau 0000:02:00.0: PGRAPH_DATA_ERROR - Ch 1/0 Class 0x5039 Mthd 0x0318 Data 0x00001000:0x00084814 [drm] nouveau 0000:02:00.0: PGRAPH_DATA_ERROR - INVALID_VALUE BUG: unable to handle kernel NULL pointer dereference at 0000000000000028 IP: [<ffffffffa00a4faf>] nouveau_ttm_io_mem_reserve+0x8f/0xba [nouveau] PGD 1bf29d067 PUD 1bf290067 PMD 0 CPU 3 Modules linked in: nouveau ttm drm_kms_helper snd_hda_codec_realtek snd_hda_intel snd_hda_codec Pid: 2546, comm: X Not tainted 2.6.35-rc2+ #362 P6T SE/System Product Name RIP: 0010:[<ffffffffa00a4faf>] [<ffffffffa00a4faf>] nouveau_ttm_io_mem_reserve+0x8f/0xba [nouveau] RSP: 0018:ffff8801be7cf908 EFLAGS: 00010246 RAX: ffff8801be5eb000 RBX: ffff8801be7cfb58 RCX: ffff8801b7180000 RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8801b7180118 RBP: ffff8801be7cf918 R08: 800000000000016b R09: 0000000000000001 R10: 00000000012e0001 R11: ffff8800d12ee000 R12: ffff8801b7180118 R13: 0000000000000000 R14: ffff8801be7cfa18 R15: 0000000000000000 FS: 00007f857faf4840(0000) GS:ffff8800026c0000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000028 CR3: 00000001be7a8000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process X (pid: 2546, threadinfo ffff8801be7ce000, task ffff8801bdc69f20) ffff8801be7cfa20 ffff8801be7cfb58 ffff8801be7cf928 ffffffffa0084761 <0> 0000000000000000 ffff8801b7180118 ffff8801be7cfa58 ffffffffa0084f18 [<ffffffffa0084761>] ttm_mem_io_reserve+0x17/0x19 [ttm] [<ffffffffa008482f>] ttm_mem_reg_ioremap+0x20/0x81 [ttm] [<ffffffffa0084f18>] ttm_bo_move_memcpy+0x6e/0x40a [ttm] [<ffffffffa00a64f8>] ? nouveau_fence_wait+0x4f/0xaf [nouveau] [<ffffffffa00a630f>] ? nouveau_fence_unref+0x24/0x2f [nouveau] [<ffffffffa0082caf>] ? ttm_bo_reserve+0x33/0xf3 [ttm] [<ffffffffa00a5a9a>] nouveau_bo_move+0x183/0x207 [nouveau] [<ffffffffa0082121>] ttm_bo_handle_move_mem+0x1b4/0x2c0 [ttm] [<ffffffffa0083fe6>] ttm_bo_move_buffer+0xe6/0x13f [ttm] [<ffffffffa00840ea>] ttm_bo_validate+0xab/0xf4 [ttm] [<ffffffffa00a6c3a>] validate_list+0x145/0x28e [nouveau] [<ffffffffa00a7dde>] nouveau_gem_ioctl_pushbuf+0xf0d/0xf39 [nouveau] [<ffffffff8129a4e2>] drm_ioctl+0x27b/0x347 [<ffffffffa00a6ed1>] ? nouveau_gem_ioctl_pushbuf+0x0/0xf39 [nouveau] [<ffffffff8110c001>] vfs_ioctl+0x2d/0xa1 [<ffffffff8110c54d>] do_vfs_ioctl+0x454/0x48d [<ffffffff810fef91>] ? fget_light+0xa4/0x289 [<ffffffff8110c5c8>] sys_ioctl+0x42/0x65 [<ffffffff8102fd6b>] system_call_fastpath+0x16/0x1b RIP [<ffffffffa00a4faf>] nouveau_ttm_io_mem_reserve+0x8f/0xba [nouveau] RSP <ffff8801be7cf908> Created attachment 36181 [details]
crash in nouveau_ttm_io_mem_reserve (2.6.35-rc2+)
For some reason I can reliably reproduce this bug, so I've added some BUG_ON's/WARN_ON's and it revealed that: - it crashes because of mem->mm_node being NULL (at least when mem->mem_type == TTM_PL_VRAM) - it comes through 2nd ttm_mem_reg_ioremap in ttm_bo_move_memcpy, so it is new_mem which mm_node == NULL - it comes through 2nd ttm_bo_move_memcpy in nouveau_bo_move, so some earlier nouveau_bo_move_* failed, which can only happen in case of some hardware failure (lockup or somthing) Created attachment 36585 [details] [review] workaround It "fixes" an oops but it doesn't address its cause - gpu lockup. Any other idea how to fix this? *** Bug 27574 has been marked as a duplicate of this bug. *** Created attachment 36734 [details] [review] alternative fix by Francisco Jerez fixed by commit 8e530bd62f9eff097caf8e8546fa85f865928b78 "drm/nouveau: Move the fence wait before migration resource clean-up." *** Bug 29039 has been marked as a duplicate of this bug. *** *** Bug 27036 has been marked as a duplicate of this bug. *** |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.
Created attachment 33225 [details] kernel log On 2.6.33-rcX s2disk works sometimes, sometimes not. When I tested it as described it kernel documentation - it worked all time, I tried it about 9 times with different debugging options and without them. But when I used laptop for half of hour and then tried to do s2disk it hanged on suspending after message "Suspending console" or something like this. One more note about testing I tested immediately after loading in single user mode and immediately after loading X server. Last kernel log, with stack-trace and NULL pointer error, I got after resume from s2ram.