Created attachment 33225 [details]
On 2.6.33-rcX s2disk works sometimes, sometimes not. When I tested it as described it kernel documentation - it worked all time, I tried it about 9 times with different debugging options and without them. But when I used laptop for half of hour and then tried to do s2disk it hanged on suspending after message "Suspending console" or something like this. One more note about testing I tested immediately after loading in single user mode and immediately after loading X server.
Last kernel log, with stack-trace and NULL pointer error, I got after resume from s2ram.
Forgot to mention more information you can find on http://bugzilla.kernel.org/show_bug.cgi?id=15120
Did anybody noticed this report?
Created attachment 33354 [details]
That is kernel log of bug reproduced today on 2.6.33-rc8 kernel compiled with full debug
About full debugging kernel in last post, I compiled kernel with:
- Compile the kernel with debug info
- Compile the kernel with frame pointers
from "Kernel hacking" section in menuconfig.
Before 2.6.33-rcX I used nvidia blob and it always works good with suspend to ram and never resume from suspend to disk. I didn't use nouveau before 2.6.33-rcX so I can't say worked it or not for me earlier.
Starting from 2.6.33-rcX I'm using nouveau included in kernel. And I have some troubles with suspending and resuming (with both to ram and to disk).
About suspend to disk - I tested it as described in kernel documentation immediately after system start without X and with it, with different debug option and all worked just well. Then I worked for an hour and tried to to suspend to disk and laptop hanged up. I'm using KDE with effects through XRender may be this can affect.
About suspend to ram - I don't know how to reproduce this bug, it appears randomly, ones ten, and actually that is all I can say about it the rest in kernel log :)
So, any fixes ideas? May be I should provide more information?
It would help to know the specific code that is causing it. gdb 'info line' is your friend, you will need the ttm.ko module or the vmlinux kernel image (if it's built in). A short code snippet to avoid confusion would be nice too (all development happens against a special kernel tree).
Created attachment 34299 [details]
I got very similar error on 2.6.34 though with slightly different trace.
Unfortunately I could not reproduce it since I built a debug kernel.
It seems ttm code was heavily changed during 2.6.35 development and the function we crash in (ttm_bo_pci_offset) was removed.
We need a new trace probably...
I've tested 2.6.35-rc1 and -rc2 and looks like it works. At least I haven't got this error since last to weeks
PM: resume of devices complete after 5330.459 msecs
[drm] nouveau 0000:02:00.0: PGRAPH_DATA_ERROR - Ch 1/0 Class 0x5039 Mthd 0x0318 Data 0x00001000:0x00084814
[drm] nouveau 0000:02:00.0: PGRAPH_DATA_ERROR - INVALID_VALUE
BUG: unable to handle kernel NULL pointer dereference at 0000000000000028
IP: [<ffffffffa00a4faf>] nouveau_ttm_io_mem_reserve+0x8f/0xba [nouveau]
PGD 1bf29d067 PUD 1bf290067 PMD 0
Modules linked in: nouveau ttm drm_kms_helper snd_hda_codec_realtek snd_hda_intel snd_hda_codec
Pid: 2546, comm: X Not tainted 2.6.35-rc2+ #362 P6T SE/System Product Name
RIP: 0010:[<ffffffffa00a4faf>] [<ffffffffa00a4faf>] nouveau_ttm_io_mem_reserve+0x8f/0xba [nouveau]
RSP: 0018:ffff8801be7cf908 EFLAGS: 00010246
RAX: ffff8801be5eb000 RBX: ffff8801be7cfb58 RCX: ffff8801b7180000
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8801b7180118
RBP: ffff8801be7cf918 R08: 800000000000016b R09: 0000000000000001
R10: 00000000012e0001 R11: ffff8800d12ee000 R12: ffff8801b7180118
R13: 0000000000000000 R14: ffff8801be7cfa18 R15: 0000000000000000
FS: 00007f857faf4840(0000) GS:ffff8800026c0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000028 CR3: 00000001be7a8000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process X (pid: 2546, threadinfo ffff8801be7ce000, task ffff8801bdc69f20)
ffff8801be7cfa20 ffff8801be7cfb58 ffff8801be7cf928 ffffffffa0084761
<0> 0000000000000000 ffff8801b7180118 ffff8801be7cfa58 ffffffffa0084f18
[<ffffffffa0084761>] ttm_mem_io_reserve+0x17/0x19 [ttm]
[<ffffffffa008482f>] ttm_mem_reg_ioremap+0x20/0x81 [ttm]
[<ffffffffa0084f18>] ttm_bo_move_memcpy+0x6e/0x40a [ttm]
[<ffffffffa00a64f8>] ? nouveau_fence_wait+0x4f/0xaf [nouveau]
[<ffffffffa00a630f>] ? nouveau_fence_unref+0x24/0x2f [nouveau]
[<ffffffffa0082caf>] ? ttm_bo_reserve+0x33/0xf3 [ttm]
[<ffffffffa00a5a9a>] nouveau_bo_move+0x183/0x207 [nouveau]
[<ffffffffa0082121>] ttm_bo_handle_move_mem+0x1b4/0x2c0 [ttm]
[<ffffffffa0083fe6>] ttm_bo_move_buffer+0xe6/0x13f [ttm]
[<ffffffffa00840ea>] ttm_bo_validate+0xab/0xf4 [ttm]
[<ffffffffa00a6c3a>] validate_list+0x145/0x28e [nouveau]
[<ffffffffa00a7dde>] nouveau_gem_ioctl_pushbuf+0xf0d/0xf39 [nouveau]
[<ffffffffa00a6ed1>] ? nouveau_gem_ioctl_pushbuf+0x0/0xf39 [nouveau]
[<ffffffff810fef91>] ? fget_light+0xa4/0x289
RIP [<ffffffffa00a4faf>] nouveau_ttm_io_mem_reserve+0x8f/0xba [nouveau]
Created attachment 36181 [details]
crash in nouveau_ttm_io_mem_reserve (2.6.35-rc2+)
For some reason I can reliably reproduce this bug, so I've added some BUG_ON's/WARN_ON's and it revealed that:
- it crashes because of mem->mm_node being NULL (at least when mem->mem_type == TTM_PL_VRAM)
- it comes through 2nd ttm_mem_reg_ioremap in ttm_bo_move_memcpy, so it is new_mem which mm_node == NULL
- it comes through 2nd ttm_bo_move_memcpy in nouveau_bo_move, so some earlier nouveau_bo_move_* failed, which can only happen in case of some hardware failure (lockup or somthing)
Created attachment 36585 [details] [review]
It "fixes" an oops but it doesn't address its cause - gpu lockup.
Any other idea how to fix this?
*** Bug 27574 has been marked as a duplicate of this bug. ***
Created attachment 36734 [details] [review]
alternative fix by Francisco Jerez
fixed by commit 8e530bd62f9eff097caf8e8546fa85f865928b78 "drm/nouveau: Move the fence wait before migration resource clean-up."
*** Bug 29039 has been marked as a duplicate of this bug. ***
*** Bug 27036 has been marked as a duplicate of this bug. ***