Bug 64774 - nouveau GF108 kernel crash in optimus mode when enabling external display output
Summary: nouveau GF108 kernel crash in optimus mode when enabling external display output
Status: NEW
Alias: None
Product: xorg
Classification: Unclassified
Component: Driver/nouveau (show other bugs)
Version: unspecified
Hardware: Other All
: medium major
Assignee: Nouveau Project
QA Contact: Xorg Project Team
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-05-19 20:54 UTC by Pasi Kärkkäinen
Modified: 2014-01-06 16:17 UTC (History)
2 users (show)

See Also:
i915 platform:
i915 features:


Attachments
dmesg Linux 3.8.11 nouveau optimus before the kernel crash (71.82 KB, text/plain)
2013-05-19 20:54 UTC, Pasi Kärkkäinen
no flags Details
Screenshot 01 of the nouveau kernel crash with traceback (661.01 KB, image/jpeg)
2013-05-19 20:55 UTC, Pasi Kärkkäinen
no flags Details
Screenshot 02 of the nouveau kernel crash with traceback (620.41 KB, image/jpeg)
2013-05-19 20:56 UTC, Pasi Kärkkäinen
no flags Details
linux-3.9.9-gpu-drm-nouveau-warn-on-node-pages.patch (514 bytes, patch)
2013-07-18 20:37 UTC, Pasi Kärkkäinen
no flags Details | Splinter Review
dmesg-3.9.9-201.dbg01-with-custom_warn-on-pages_patch-optimus_enabled-nouveau_does_not_work.txt (73.01 KB, text/plain)
2013-07-18 20:46 UTC, Pasi Kärkkäinen
no flags Details
dmesg-3.9.9-201.dbg01.fc18.x86_64-optimus_enabled-warn-traceback-after-trying-to-enable-dvi-monitor (76.68 KB, text/plain)
2013-07-19 14:24 UTC, Pasi Kärkkäinen
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Pasi Kärkkäinen 2013-05-19 20:54:37 UTC
Created attachment 79549 [details]
dmesg Linux 3.8.11 nouveau optimus before the kernel crash

I'm using Lenovo T430 laptop with intel+nvidia hybrid graphics, optimus is enabled in BIOS:

$ lspci | grep VGA
00:02.0 VGA compatible controller: Intel Corporation 3rd Gen Core processor Graphics Controller (rev 09)
01:00.0 VGA compatible controller: NVIDIA Corporation GF108 [Quadro NVS 5400M] (rev a1)

$ uname -a
Linux localhost.localdomain 3.8.11-200.fc18.x86_64 #1 SMP Wed May 1 19:44:27 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

I boot up the system with only laptop internal LVDS active on intel gpu, and when in X session, I enable the second display on nouveau gpu connected to the dock DP2/DVI connector. 
At this point the kernel crashes immediately.

Screenshots of the kernel crash/stacktrace here:
http://pasik.reaktio.net/nouveau/debug/nouveau-kernel-crash01.jpg
http://pasik.reaktio.net/nouveau/debug/nouveau-kernel-crash02.jpg

hand-written bits from the kernel crash:

Pid: 1208, comm: Xorg Not tainted 3.8.11-200.fc18.x86_64 #1 LENOVO 2349H2G/2349H2G
RIP: 0010:[<ffffffffa01abb7e>] [<ffffffffa01abb7e>] nvc0_vm_map_sg+0x8e/0x110 [nouveau]
RSP: 0018:ffff8803298357c8 EFLAGS: 00010206
..
Process Xorg (pid: 1208,..)
..
Call Trace:
.. nouveau_vm_map_sg+0xc2/0x130 [nouveau]
.. nouveau_vma_getmap.isra.11+0x68/0xa0 [nouveau]
.. nouveau_bo_move_m2mf.isra.12+0x85/0x140 [nouveau]
.. ? nouveau_vm_map_at+0x153/0x1c0 [nouveau]
.. nouveau_bo_move+0xa5/0x400 [nouveau]
.. ttm_bo_handle_move_mem+0x245/0x610 [ttm]

More info from the screenshots above (also attached)..

Kernel dmesg from before the crash attached.
I also tried with Linux kernel 3.9.2 and there the system crashes hard aswell.
Comment 1 Pasi Kärkkäinen 2013-05-19 20:55:18 UTC
Created attachment 79550 [details]
Screenshot 01 of the nouveau kernel crash with traceback
Comment 2 Pasi Kärkkäinen 2013-05-19 20:56:08 UTC
Created attachment 79551 [details]
Screenshot 02 of the nouveau kernel crash with traceback
Comment 3 Ilia Mirkin 2013-05-19 21:03:49 UTC
Would be nice to see the output of

gdb /lib/modules/..../nvidia.ko

disassemble nvc0_vm_map_sg

Looks like there's a null deref in there (CR2 = 0). Although I think that optimus mode isn't supposed to be well-supported right now. But we still shouldn't be oopsing the kernel.
Comment 4 Pasi Kärkkäinen 2013-05-19 21:10:30 UTC
Ok, here goes:

Reading symbols from /usr/lib/modules/3.8.11-200.fc18.x86_64/kernel/drivers/gpu/drm/nouveau/nouveau.ko...(no debugging symbols found)...done.
Missing separate debuginfos, use: debuginfo-install kernel-3.8.11-200.fc18.x86_64

(gdb) disassemble nvc0_vm_map_sg
Dump of assembler code for function nvc0_vm_map_sg:
   0x0000000000023b20 <+0>:	callq  0x23b25 <nvc0_vm_map_sg+5>
   0x0000000000023b25 <+5>:	push   %rbp
   0x0000000000023b26 <+6>:	mov    %rsp,%rbp
   0x0000000000023b29 <+9>:	push   %r15
   0x0000000000023b2b <+11>:	push   %r14
   0x0000000000023b2d <+13>:	push   %r13
   0x0000000000023b2f <+15>:	mov    %rsi,%r13
   0x0000000000023b32 <+18>:	push   %r12
   0x0000000000023b34 <+20>:	push   %rbx
   0x0000000000023b35 <+21>:	lea    0x0(,%rcx,8),%ebx
   0x0000000000023b3c <+28>:	sub    $0x38,%rsp
   0x0000000000023b40 <+32>:	mov    0x30(%rdi),%esi
   0x0000000000023b43 <+35>:	mov    %rdx,-0x40(%rbp)
   0x0000000000023b47 <+39>:	mov    %rdi,-0x38(%rbp)
   0x0000000000023b4b <+43>:	mov    %r9,-0x48(%rbp)
   0x0000000000023b4f <+47>:	lea    -0x1(%r8),%edx
   0x0000000000023b53 <+51>:	mov    %esi,%eax
   0x0000000000023b55 <+53>:	and    $0x10,%eax
   0x0000000000023b58 <+56>:	cmp    $0x1,%eax
   0x0000000000023b5b <+59>:	sbb    %eax,%eax
   0x0000000000023b5d <+61>:	and    $0xfffffffe,%eax
   0x0000000000023b60 <+64>:	add    $0x7,%eax
Comment 5 Pasi Kärkkäinen 2013-05-19 21:12:56 UTC
here's the whole function:

(gdb) disassemble nvc0_vm_map_sg
Dump of assembler code for function nvc0_vm_map_sg:
   0x0000000000023b20 <+0>:	callq  0x23b25 <nvc0_vm_map_sg+5>
   0x0000000000023b25 <+5>:	push   %rbp
   0x0000000000023b26 <+6>:	mov    %rsp,%rbp
   0x0000000000023b29 <+9>:	push   %r15
   0x0000000000023b2b <+11>:	push   %r14
   0x0000000000023b2d <+13>:	push   %r13
   0x0000000000023b2f <+15>:	mov    %rsi,%r13
   0x0000000000023b32 <+18>:	push   %r12
   0x0000000000023b34 <+20>:	push   %rbx
   0x0000000000023b35 <+21>:	lea    0x0(,%rcx,8),%ebx
   0x0000000000023b3c <+28>:	sub    $0x38,%rsp
   0x0000000000023b40 <+32>:	mov    0x30(%rdi),%esi
   0x0000000000023b43 <+35>:	mov    %rdx,-0x40(%rbp)
   0x0000000000023b47 <+39>:	mov    %rdi,-0x38(%rbp)
   0x0000000000023b4b <+43>:	mov    %r9,-0x48(%rbp)
   0x0000000000023b4f <+47>:	lea    -0x1(%r8),%edx
   0x0000000000023b53 <+51>:	mov    %esi,%eax
   0x0000000000023b55 <+53>:	and    $0x10,%eax
   0x0000000000023b58 <+56>:	cmp    $0x1,%eax
   0x0000000000023b5b <+59>:	sbb    %eax,%eax
   0x0000000000023b5d <+61>:	and    $0xfffffffe,%eax
   0x0000000000023b60 <+64>:	add    $0x7,%eax
   0x0000000000023b63 <+67>:	test   %r8d,%r8d
   0x0000000000023b66 <+70>:	je     0x23c16 <nvc0_vm_map_sg+246>
   0x0000000000023b6c <+76>:	mov    %rax,%rcx
   0x0000000000023b6f <+79>:	lea    0x4(%rbx),%eax
   0x0000000000023b72 <+82>:	lea    0x8(,%rdx,8),%rdx
   0x0000000000023b7a <+90>:	xor    %r15d,%r15d
   0x0000000000023b7d <+93>:	shl    $0x20,%rcx
   0x0000000000023b81 <+97>:	mov    %eax,-0x5c(%rbp)
   0x0000000000023b84 <+100>:	mov    %r13,%rax
   0x0000000000023b87 <+103>:	mov    %rcx,-0x50(%rbp)
   0x0000000000023b8b <+107>:	mov    %r15,%r13
   0x0000000000023b8e <+110>:	mov    %rdx,-0x58(%rbp)
   0x0000000000023b92 <+114>:	mov    %rax,%r15
   0x0000000000023b95 <+117>:	jmp    0x23ba7 <nvc0_vm_map_sg+135>
   0x0000000000023b97 <+119>:	nopw   0x0(%rax,%rax,1)
   0x0000000000023ba0 <+128>:	mov    -0x38(%rbp),%rdx
   0x0000000000023ba4 <+132>:	mov    0x30(%rdx),%esi
   0x0000000000023ba7 <+135>:	mov    -0x48(%rbp),%rcx
   0x0000000000023bab <+139>:	mov    %r15,%rdi
   0x0000000000023bae <+142>:	mov    (%rcx,%r13,1),%rax
   0x0000000000023bb2 <+146>:	shr    $0x8,%rax
   0x0000000000023bb6 <+150>:	mov    %rax,%rdx
   0x0000000000023bb9 <+153>:	or     $0x3,%rax
   0x0000000000023bbd <+157>:	or     $0x1,%rdx
   0x0000000000023bc1 <+161>:	and    $0x4,%esi
   0x0000000000023bc4 <+164>:	lea    0x0(%r13,%rbx,1),%esi
   0x0000000000023bc9 <+169>:	cmovne %rax,%rdx
   0x0000000000023bcd <+173>:	mov    -0x40(%rbp),%rax
   0x0000000000023bd1 <+177>:	mov    0xd8(%rax),%r14d
   0x0000000000023bd8 <+184>:	shl    $0x24,%r14
   0x0000000000023bdc <+188>:	or     -0x50(%rbp),%r14
   0x0000000000023be0 <+192>:	or     %rdx,%r14
   0x0000000000023be3 <+195>:	mov    (%r15),%rdx
   0x0000000000023be6 <+198>:	mov    0x8(%rdx),%r10
   0x0000000000023bea <+202>:	mov    %r14d,%edx
   0x0000000000023bed <+205>:	callq  *0x48(%r10)
   0x0000000000023bf1 <+209>:	mov    (%r15),%rdx
   0x0000000000023bf4 <+212>:	mov    -0x5c(%rbp),%esi
   0x0000000000023bf7 <+215>:	mov    %r15,%rdi
   0x0000000000023bfa <+218>:	mov    0x8(%rdx),%r10
   0x0000000000023bfe <+222>:	mov    %r14,%rdx
   0x0000000000023c01 <+225>:	add    %r13d,%esi
   0x0000000000023c04 <+228>:	shr    $0x20,%rdx
   0x0000000000023c08 <+232>:	add    $0x8,%r13
   0x0000000000023c0c <+236>:	callq  *0x48(%r10)
   0x0000000000023c10 <+240>:	cmp    -0x58(%rbp),%r13
   0x0000000000023c14 <+244>:	jne    0x23ba0 <nvc0_vm_map_sg+128>
   0x0000000000023c16 <+246>:	add    $0x38,%rsp
   0x0000000000023c1a <+250>:	pop    %rbx
   0x0000000000023c1b <+251>:	pop    %r12
   0x0000000000023c1d <+253>:	pop    %r13
   0x0000000000023c1f <+255>:	pop    %r14
   0x0000000000023c21 <+257>:	pop    %r15
   0x0000000000023c23 <+259>:	pop    %rbp
   0x0000000000023c24 <+260>:	retq   
End of assembler dump.
(gdb)
Comment 6 Ilia Mirkin 2013-05-19 21:42:34 UTC
Well, +0x8e is +142, and we see

   0x0000000000023bae <+142>:	mov    (%rcx,%r13,1),%rax
   0x0000000000023bb2 <+146>:	shr    $0x8,%rax
   0x0000000000023bb6 <+150>:	mov    %rax,%rdx
   0x0000000000023bb9 <+153>:	or     $0x3,%rax
   0x0000000000023bbd <+157>:	or     $0x1,%rdx

which I'm fairly sure corresponds to

		u64 phys = nvc0_vm_addr(vma, *list++, memtype, target);

Since

static inline u64
nvc0_vm_addr(struct nouveau_vma *vma, u64 phys, u32 memtype, u32 target)
{
	phys >>= 8;
	phys |= 0x00000001; /* present */
	if (vma->access & NV_MEM_ACCESS_SYS)
		phys |= 0x00000002;

(And for some reason it splits the two branches into two separate registers... odd, but nothing else in the code matches up as nicely.)

So that means that the passed in list pointer must be null. This corresponds to

drivers/gpu/drm/nouveau/nouveau_bo.c:nouveau_vma_getmap which passes in mem->mm_node as the mem argument to vm_map_sg, which in turn does mem->pages.

So perhaps add something to the top of nouveau_vma_getmap (before the vm_get call) like

if (WARN_ON(!node->pages)) {
  return -EINVAL;
}

Which should help avoid the crash, but will not provide any additional functionality. You should then see a backtrace, but no crash.
Comment 7 Ilia Mirkin 2013-05-19 21:44:30 UTC
Actually, that should probably be WARN_ON(mem->mem_type != TTM_PL_VRAM && !node->pages) otherwise you'll get a lot of spurious warns/failures.
Comment 8 Pasi Kärkkäinen 2013-07-18 20:36:02 UTC
I tested the most recent Fedora 18 kernel update (3.9.9-201.fc18.x86_64), and the bug is still there. I enabled Optimus in the BIOS, booted to Linux, tried to enable nouveau DVI output, and the kernel crashed immediately.

Then I built a custom kernel based on 3.9.9-201.fc18.x86_64 with the attached patch included (as suggested above), and booted with that. Now there's a problem.. nouveau outputs are not detected, and cannot be enabled, so I don't know if the patch fixes or works around the problem. At least it breaks something :) 

There are no obvious nouveau related errors in the dmesg.. attached aswell.
Comment 9 Pasi Kärkkäinen 2013-07-18 20:37:10 UTC
Created attachment 82633 [details] [review]
linux-3.9.9-gpu-drm-nouveau-warn-on-node-pages.patch
Comment 10 Pasi Kärkkäinen 2013-07-18 20:46:17 UTC
Created attachment 82635 [details]
dmesg-3.9.9-201.dbg01-with-custom_warn-on-pages_patch-optimus_enabled-nouveau_does_not_work.txt
Comment 11 Ilia Mirkin 2013-07-18 20:53:57 UTC
That last dmesg looks fine... what's the problem? (It seems unlikely that merely adding a WARN_ON that never fires would fix things, points to a compiler issue? Or a different config/compilation mechanism/??)

You now have an intel and nvidia devices. I think it starts out with the intel device, you can use the regular vga_switcheroo mechanisms to switch to the nvidia device. Some docs about it are available at https://help.ubuntu.com/community/HybridGraphics. If you're aware of all that, can you be more specific as to what you're doing and what exactly doesn't work?
Comment 12 Pasi Kärkkäinen 2013-07-18 21:04:49 UTC
Ok, sorry for not explaining properly. 

With the stock fedora 3.9.9-201 kernel the nouveau dvi outputs are detected OK, so in the Mate desktop display properties application I can see the external monitor that is connected with DVI, and I can choose to enable it, which causes the kernel crash. 

Now with the custom 3.9.9-201.dbg01 kernel that I built, which only has the warn_on patch added on top of the stock kernel, the nouveau outputs are *not* detected at all, so I can't see the DVI monitor on the display properties application, and I can't enable the DVI monitor.. 

I tried rebooting the laptop multiple times, and the behaviour stays the same. 

Maybe I should play with xrandr aswell. And probably rebuild a new kernel, with that patch removed and see if it really causes the issue.. 

Thanks for your reply!
Comment 13 Ilia Mirkin 2013-07-18 21:15:08 UTC
Outputs certainly *appear* to be detected. What makes you say they're not?

[    2.882654] nouveau  [     DRM] DCB outp 00: 01800323 00010034
[    2.882656] nouveau  [     DRM] DCB outp 01: 02811300 00000000
[    2.882657] nouveau  [     DRM] DCB outp 02: 028223a6 0f220010
[    2.882658] nouveau  [     DRM] DCB outp 03: 02822362 00020010
[    2.882659] nouveau  [     DRM] DCB outp 04: 048333b6 0f220010
[    2.882660] nouveau  [     DRM] DCB outp 05: 04833372 00020010
[    2.882662] nouveau  [     DRM] DCB outp 06: 088443c6 0f220010
[    2.882663] nouveau  [     DRM] DCB outp 07: 08844382 00020010
[    2.882664] nouveau  [     DRM] DCB conn 00: 00000040
[    2.882666] nouveau  [     DRM] DCB conn 01: 00000100
[    2.882667] nouveau  [     DRM] DCB conn 02: 00110246
[    2.882668] nouveau  [     DRM] DCB conn 03: 00220346
[    2.882669] nouveau  [     DRM] DCB conn 04: 01400446
[    3.023865] nouveau  [     DRM] allocated 1920x1080 fb: 0x60000, bo ffff8803285f9c00

I assume you have a 1920x1080 screen hooked up to the DVI?

You can look at the KMS status in /sys/class/drm/ -- you should see cardN-DVI-I-N  directories. xrandr should also provide accurate status, e.g.:

$ grep . /sys/class/drm/card*-*/status 
/sys/class/drm/card0-DVI-I-1/status:connected
/sys/class/drm/card0-DVI-I-2/status:disconnected

However it may end up being hidden due to vgaswitcheroo -- so make sure to look into that too. (Unfortunately I know next to nothing about vgaswitcheroo, but I'm sure Google knows more.)
Comment 14 Pasi Kärkkäinen 2013-07-19 12:52:12 UTC
Ok, after a lot of reboots and trying different kernels I realized the issue is not related to the warn_on patch I added. The difference is caused by the fact that if I boot with DVI monitor connected before powering on the system, or not.

So what happens is:

1) I boot to Linux with DVI monitor connected to nouveau before powering on the system. This is how it looks then:

$ grep . /sys/class/drm/card*-*/status
/sys/class/drm/card0-LVDS-1/status:connected
/sys/class/drm/card0-VGA-1/status:disconnected
/sys/class/drm/card1-DP-1/status:disconnected
/sys/class/drm/card1-DP-2/status:connected
/sys/class/drm/card1-DP-3/status:disconnected
/sys/class/drm/card1-LVDS-2/status:disconnected
/sys/class/drm/card1-VGA-2/status:disconnected

DP-2 is the external DVI monitor. So it's listed as "connected", but xrandr doesn't see it:

$ xrandr
Screen 0: minimum 320 x 200, current 1600 x 900, maximum 8192 x 8192
LVDS1 connected 1600x900+0+0 (normal left inverted right x axis y axis) 309mm x 174mm
   1600x900       60.0*+   40.0  
   1024x768       60.0  
   800x600        60.3     56.2  
   640x480        59.9  
VGA1 disconnected (normal left inverted right x axis y axis)


Nothing about the nouveau adapter in xrandr output.. 


2) I boot to Linux with DVI monitor NOT connected to nouveau. I connect the DVI cable after the system has booted to Xorg. This is how it looks then:

$ grep . /sys/class/drm/card*-*/status
/sys/class/drm/card0-LVDS-1/status:connected
/sys/class/drm/card0-VGA-1/status:disconnected
/sys/class/drm/card1-DP-1/status:disconnected
/sys/class/drm/card1-DP-2/status:connected
/sys/class/drm/card1-DP-3/status:disconnected
/sys/class/drm/card1-LVDS-2/status:disconnected
/sys/class/drm/card1-VGA-2/status:disconnected

So again DP-2 is shown as connected to the DVI monitor, which is true. 

$ xrandr
Screen 0: minimum 320 x 200, current 1600 x 900, maximum 8192 x 8192
LVDS1 connected 1600x900+0+0 (normal left inverted right x axis y axis) 309mm x 174mm
   1600x900       60.0*+   40.0  
   1024x768       60.0  
   800x600        60.3     56.2  
   640x480        59.9  
VGA1 disconnected (normal left inverted right x axis y axis)
LVDS-2 disconnected (normal left inverted right x axis y axis)
VGA-2 disconnected (normal left inverted right x axis y axis)
DP-1 disconnected (normal left inverted right x axis y axis)
DP-2 connected (normal left inverted right x axis y axis)
   1920x1080      59.9 +
   1600x1200      60.0  
   1680x1050      59.9  
   1280x1024      60.0  
   1440x900       59.9  
   1280x960       60.0  
   1280x800       59.9  
   1024x768       60.0  
   800x600        60.3     56.2  
   640x480        60.0  
DP-3 disconnected (normal left inverted right x axis y axis)
  1024x768 (0x8e)   65.0MHz
        h: width  1024 start 1048 end 1184 total 1344 skew    0 clock   48.4KHz
        v: height  768 start  771 end  777 total  806           clock   60.0Hz
  800x600 (0x8f)   40.0MHz
        h: width   800 start  840 end  968 total 1056 skew    0 clock   37.9KHz
        v: height  600 start  601 end  605 total  628           clock   60.3Hz
  800x600 (0x90)   36.0MHz
        h: width   800 start  824 end  896 total 1024 skew    0 clock   35.2KHz
        v: height  600 start  601 end  603 total  625           clock   56.2Hz

And now xrandr can see the DVI monitor aswell. So when I plug-in the DVI cable on-the-fly to nouveau adapter, then the external display is actually detected, and can be enabled and used (with artifacts/corruption, but that is a separate bug ;) 

So 'xrandr' shows the same problem as I described earlier with the Mate desktop display-properties GUI application. When the laptop is booted with DVI cable plugged in from the start, the mate-display-properties GUI doesn't allow me to enable the DVI monitor - so the behaviour matches xrandr. I assume mate-display-properties internally uses xrandr. 

But this sounds like an another bug really..
Comment 15 Pasi Kärkkäinen 2013-07-19 14:23:31 UTC
xrandr not working if booted with DVI cable connected doesn't seem to be the case *always*.. it's seems to be a bit random. one boot out of 10, or so, I'm able to see the nouveau connectors on xrandr, so I'm able to try to enable them, and thus hit the warn_on in the patch I added to the kernel. So here goes:

[  168.514913] ------------[ cut here ]------------
[  168.514954] WARNING: at drivers/gpu/drm/nouveau/nouveau_bo.c:952 nouveau_vma_getmap.isra.11+0xc1/0xe0 [nouveau]()
[  168.514956] Hardware name: 2349H2G
[  168.514957] Modules linked in: fuse ebtable_nat ipt_MASQUERADE nf_conntrack_netbios_ns nf_conntrack_broadcast ip6
table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle nf_conntrack
_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack be2iscsi iscsi_boot_sysfs bnx2i ebtable_filter cnic ebtables uio ip6t
able_filter cxgb4i cxgb4 ip6_tables cxgb3i cxgb3 mdio libcxgbi ib_iser rdma_cm ib_addr iw_cm ib_cm ib_sa ib_mad ib_c
ore iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi rfcomm bnep snd_hda_codec_realtek iTCO_wdt iTCO_vendor_support arc4 iwldvm acpi_cpufreq mac80211 mperf coretemp uvcvideo microcode snd_hda_intel videobuf2_vmalloc snd_hda_codec i2c_i801 videobuf2_memops snd_hwdep videobuf2_core snd_seq iwlwifi snd_seq_device videodev cdc_mbim btusb snd_pcm
[  168.514987]  media bluetooth cfg80211 cdc_ncm usbnet cdc_wdm cdc_acm mii lpc_ich e1000e mfd_core snd_page_alloc snd_timer mei ptp thinkpad_acpi pps_core snd soundcore rfkill vhost_net tun macvtap macvlan kvm_intel kvm uinput dm_crypt crc32_pclmul crc32c_intel nouveau i915 ghash_clmulni_intel mxm_wmi sdhci_pci i2c_algo_bit sdhci ttm drm_kms_helper mmc_core drm i2c_core wmi video
[  168.515008] Pid: 914, comm: Xorg Not tainted 3.9.9-201.dbg01.fc18.x86_64 #1
[  168.515009] Call Trace:
[  168.515017]  [<ffffffff8105efc5>] warn_slowpath_common+0x75/0xa0
[  168.515019]  [<ffffffff8105f00a>] warn_slowpath_null+0x1a/0x20
[  168.515036]  [<ffffffffa020d801>] nouveau_vma_getmap.isra.11+0xc1/0xe0 [nouveau]
[  168.515046]  [<ffffffffa019292a>] ? _nouveau_gpuobj_wr32+0x2a/0x30 [nouveau]
[  168.515062]  [<ffffffffa020d8a6>] nouveau_bo_move_m2mf.isra.12+0x86/0x140 [nouveau]
[  168.515077]  [<ffffffffa01b6b23>] ? nouveau_vm_map_at+0x153/0x1c0 [nouveau]
[  168.515092]  [<ffffffffa020e2aa>] nouveau_bo_move+0x9a/0x400 [nouveau]
[  168.515100]  [<ffffffffa007ee15>] ttm_bo_handle_move_mem+0x245/0x610 [ttm]
[  168.515106]  [<ffffffffa007fd00>] ? ttm_bo_mem_space+0x180/0x360 [ttm]
[  168.515112]  [<ffffffffa007fff7>] ttm_bo_move_buffer+0x117/0x130 [ttm]
[  168.515116]  [<ffffffff8122b1fa>] ? ext4_dirty_inode+0x5a/0x70
[  168.515125]  [<ffffffffa00800aa>] ttm_bo_validate+0x9a/0x110 [ttm]
[  168.515140]  [<ffffffffa020eb1c>] nouveau_bo_validate+0x1c/0x20 [nouveau]
[  168.515155]  [<ffffffffa020ed4b>] nouveau_bo_pin+0x9b/0x100 [nouveau]
[  168.515158]  [<ffffffff8130c6b4>] ? snprintf+0x34/0x40
[  168.515173]  [<ffffffffa0231b25>] nv50_crtc_mode_set_base+0x55/0xf0 [nouveau]
[  168.515180]  [<ffffffffa007320b>] drm_crtc_helper_set_config+0x77b/0xb30 [drm_kms_helper]
[  168.515194]  [<ffffffffa003c76e>] drm_mode_set_config_internal+0x2e/0x60 [drm]
[  168.515204]  [<ffffffffa003eecc>] drm_mode_setcrtc+0x10c/0x570 [drm]
[  168.515208]  [<ffffffff8165f88d>] ? mutex_lock+0x1d/0x50
[  168.515217]  [<ffffffffa002f483>] drm_ioctl+0x4d3/0x580 [drm]
[  168.515220]  [<ffffffff81166dc5>] ? mmap_region+0x1c5/0x590
[  168.515230]  [<ffffffffa003edc0>] ? drm_mode_setplane+0x3b0/0x3b0 [drm]
[  168.515234]  [<ffffffff811b1a97>] do_vfs_ioctl+0x97/0x580
[  168.515238]  [<ffffffff812a185a>] ? inode_has_perm.isra.32.constprop.62+0x2a/0x30
[  168.515240]  [<ffffffff812a2ee7>] ? file_has_perm+0x97/0xb0
[  168.515243]  [<ffffffff811b2011>] sys_ioctl+0x91/0xb0
[  168.515247]  [<ffffffff8166afd9>] system_call_fastpath+0x16/0x1b
[  168.515249] ---[ end trace 951f146548a30c98 ]---

I'll also attach the whole dmesg.
Comment 16 Pasi Kärkkäinen 2013-07-19 14:24:37 UTC
Created attachment 82692 [details]
dmesg-3.9.9-201.dbg01.fc18.x86_64-optimus_enabled-warn-traceback-after-trying-to-enable-dvi-monitor
Comment 17 Pasi Kärkkäinen 2013-07-19 14:30:02 UTC
Oh, one more interesting piece of information.. 

If I boot (power-on) the laptop *without* DVI cable connected, the nouveau connectors will show up in 'xrandr' output as DP-1, DP-2 and DP-3. 

Now If i boot (power-on) the laptop *with* DVI cable connected, the nouveau connectors will *NOT* show up in 'xrandr' most of the times, but during that maybe one time out of ten when they do show up in 'xrandr', they're actually called DisplayPort-0, DisplayPort-1 and DisplayPort-2. 

I wonder why that difference.. and if it's related to the issue I'm having or not. I'm only able to hit the nouveau kernel crash when the connectors are called DisplayPort-X, it seems. When the connectors are called DP-X, I'm not able to crash the kernel.
Comment 18 Pasi Kärkkäinen 2013-07-21 10:13:56 UTC
I think I'll open a separate bug about the non-consistent/random output detection for xrandr and changing connector names. 

So let's focus on the kernel crash and troubleshooting that in this bug. 
Any comments about the WARNING traceback I posted?
Comment 19 Pasi Kärkkäinen 2013-08-13 18:23:20 UTC
Ok separate bug opened:

"nouveau inconsistent changing output connector names in xrandr":
https://bugs.freedesktop.org/show_bug.cgi?id=68075

So let's focus on the kernel crash on this bug.
I just verified the same kernel crash also happens on Linux 3.10.4.
Comment 20 Pasi Kärkkäinen 2013-08-13 20:32:50 UTC
I built custom Linux 3.10.6 kernel with the patch from this bug's attachments (linux-3.9.9-gpu-drm-nouveau-warn-on-node-pages.patch) applied to add WARN_ON() to nouveau_vma_getmap() and I'm still hitting the same issue when enabling an output on nouveau gpu, which would lead to a hard kernel crash without the patch:


[   54.324698] ------------[ cut here ]------------
[   54.324764] WARNING: at drivers/gpu/drm/nouveau/nouveau_bo.c:952 nouveau_vma_getmap.isra.11+0xc1/0xe0 [nouveau]()
[   54.324767] Modules linked in: fuse ebtable_nat ipt_MASQUERADE nf_conntrack_netbios_ns nf_conntrack_broadcast ip6table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ebtable_filter ebtables ip6table_filter ip6_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i cxgb3 mdio libcxgbi ib_iser rdma_cm ib_addr iw_cm ib_cm ib_sa ib_mad ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi rfcomm bnep snd_hda_codec_realtek iTCO_wdt iTCO_vendor_support acpi_cpufreq mperf coretemp microcode i2c_i801 arc4 snd_hda_intel uvcvideo snd_hda_codec btusb videobuf2_vmalloc videobuf2_memops bluetooth videobuf2_core videodev snd_hwdep media snd_seq iwldvm snd_seq_device cdc_mbim mac80211
[   54.324820]  snd_pcm lpc_ich mfd_core snd_page_alloc cdc_ncm usbnet iwlwifi cdc_wdm mii cdc_acm cfg80211 snd_timer e1000e ptp mei_me mei thinkpad_acpi pps_core snd soundcore rfkill vhost_net tun macvtap macvlan kvm_intel kvm uinput dm_crypt crc32_pclmul crc32c_intel nouveau i915 ghash_clmulni_intel mxm_wmi ttm sdhci_pci i2c_algo_bit sdhci drm_kms_helper mmc_core drm wmi i2c_core video
[   54.324858] CPU: 0 PID: 1089 Comm: Xorg Not tainted 3.10.6-100.dbg01.fc18.x86_64 #1
[   54.324861] Hardware name: LENOVO 2349H2G/2349H2G, BIOS G1ET93WW (2.53 ) 03/08/2013
[   54.324863]  0000000000000009 ffff880316abd808 ffffffff81656006 ffff880316abd848
[   54.324869]  ffffffff8105d660 ffff880316abd858 ffff88032a772458 ffff8803063c0a00
[   54.324873]  ffff8803063c0a40 ffff880316abda58 ffff88032813c170 ffff880316abd858
[   54.324906] Call Trace:
[   54.324939]  [<ffffffff81656006>] dump_stack+0x19/0x1b
[   54.324950]  [<ffffffff8105d660>] warn_slowpath_common+0x70/0xa0
[   54.324957]  [<ffffffff8105d6aa>] warn_slowpath_null+0x1a/0x20
[   54.325002]  [<ffffffffa0229961>] nouveau_vma_getmap.isra.11+0xc1/0xe0 [nouveau]
[   54.325030]  [<ffffffffa01ad94a>] ? _nouveau_gpuobj_wr32+0x2a/0x30 [nouveau]
[   54.325068]  [<ffffffffa0229a06>] nouveau_bo_move_m2mf.isra.12+0x86/0x140 [nouveau]
[   54.325106]  [<ffffffffa01cf633>] ? nouveau_vm_map_at+0x153/0x1c0 [nouveau]
[   54.325146]  [<ffffffffa022a415>] nouveau_bo_move+0xa5/0x400 [nouveau]
[   54.325169]  [<ffffffffa0088e7d>] ttm_bo_handle_move_mem+0x25d/0x630 [ttm]
[   54.325185]  [<ffffffffa0089d90>] ? ttm_bo_mem_space+0x180/0x360 [ttm]
[   54.325200]  [<ffffffffa008a097>] ttm_bo_move_buffer+0x127/0x140 [ttm]
[   54.325224]  [<ffffffffa008a14a>] ttm_bo_validate+0x9a/0x110 [ttm]
[   54.325261]  [<ffffffffa022ac4c>] nouveau_bo_validate+0x1c/0x20 [nouveau]
[   54.325299]  [<ffffffffa022ae7b>] nouveau_bo_pin+0x9b/0x100 [nouveau]
[   54.325307]  [<ffffffff81301d64>] ? snprintf+0x34/0x40
[   54.325348]  [<ffffffffa024da95>] nv50_crtc_mode_set_base+0x55/0xf0 [nouveau]
[   54.325364]  [<ffffffffa006d31b>] drm_crtc_helper_set_config+0x79b/0xb60 [drm_kms_helper]
[   54.325396]  [<ffffffffa003640e>] drm_mode_set_config_internal+0x2e/0x60 [drm]
[   54.325420]  [<ffffffffa0038a0b>] drm_mode_setcrtc+0xfb/0x620 [drm]
[   54.325428]  [<ffffffff81658fcd>] ? mutex_lock+0x1d/0x50
[   54.325449]  [<ffffffffa0029479>] drm_ioctl+0x549/0x680 [drm]
[   54.325474]  [<ffffffffa0038910>] ? drm_mode_setplane+0x3b0/0x3b0 [drm]
[   54.325485]  [<ffffffff811af2a7>] do_vfs_ioctl+0x97/0x580
[   54.325494]  [<ffffffff81296daa>] ? inode_has_perm.isra.33.constprop.63+0x2a/0x30
[   54.325501]  [<ffffffff81298367>] ? file_has_perm+0x97/0xb0
[   54.325508]  [<ffffffff8119fca5>] ? __sb_end_write+0x35/0x70
[   54.325514]  [<ffffffff811af821>] SyS_ioctl+0x91/0xb0
[   54.325524]  [<ffffffff81664659>] system_call_fastpath+0x16/0x1b
[   54.325528] ---[ end trace d7d4b0036de28fa5 ]---
Comment 21 Pasi Kärkkäinen 2013-08-21 16:56:19 UTC
Any comments about this issue? 

The patch (linux-3.9.9-gpu-drm-nouveau-warn-on-node-pages.patch) from comment #9 fixes the reproducible hard kernel crash for me.. should we push this to Linux, or should we dig deeper about why nouveau_vma_getmap() is called with wrong arguments resulting in null pointer crash?
Comment 22 Ilia Mirkin 2013-08-21 17:03:59 UTC
We should figure out why it's getting stuff that it doesn't expect. I had suggested adding the warn solely to avoid the crash down the line. I guess it's reasonable for the upstream kernel to have that, as if it ever fires, we're going to be in trouble later... I think. Will have to re-check the code. I can post it with my analysis if you like, or you can.

Could you retest this with 3.11-rc6 BTW? The whole fb pinning/etc logic was reworked, and your issue may have been fixed by it.
Comment 23 Pasi Kärkkäinen 2013-08-21 17:07:27 UTC
Feel free to post it with your analysis! You can add my reported-by and tested-by. 

I'm able to reproduce the crash with Linux 3.8.x, 3.9.x and 3.10.x, so the patch is needed at least on the 3.10.x stable branch to fix the crash there.

I'll also test with 3.11-rc.
Comment 24 Pasi Kärkkäinen 2013-10-01 16:34:25 UTC
Some comments here.. this bug is still being discussed on the mailing lists, and Ben Skeggs said he'll take a look at the kernel side fixes soon.

Also good to know: I'm able to reliably reproduce this kernel crash with Fedora 18 userspace/xorg, but not with Fedora 19 userspace/xorg.
Comment 25 Pawel Sikora 2014-01-02 00:05:05 UTC
fc20 crash: https://bugzilla.redhat.com/show_bug.cgi?id=1047169
Comment 26 Ilia Mirkin 2014-01-02 01:19:28 UTC
FWIW, my patch, which was deemed inappropriate for inclusion: http://marc.info/?l=dri-devel&m=137713025702333&w=2

Later in the thread, another patch was proposed: http://marc.info/?l=dri-devel&m=137715564208095&w=2

However I don't think it has made it upstream yet as Ben had some concerns.
Comment 27 Pawel Sikora 2014-01-06 16:17:35 UTC
(In reply to comment #26)
> FWIW, my patch, which was deemed inappropriate for inclusion:
> http://marc.info/?l=dri-devel&m=137713025702333&w=2
> 
> Later in the thread, another patch was proposed:
> http://marc.info/?l=dri-devel&m=137715564208095&w=2
> 
> However I don't think it has made it upstream yet as Ben had some concerns.

the latest patch mentioned on lkml thread only changes hard system locks
into soft locks for kernel-3.12.6-300.fc20.x86_64. the suspend/resume cycles
works pretty stable for rawhide kernel-3.13.0-0.rc7.git0.2.fc21.x86_64.

afaics, the 3.13 kernel contains some acpi/pm fixes in the nouveau area.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.