Created attachment 79549 [details] dmesg Linux 3.8.11 nouveau optimus before the kernel crash I'm using Lenovo T430 laptop with intel+nvidia hybrid graphics, optimus is enabled in BIOS: $ lspci | grep VGA 00:02.0 VGA compatible controller: Intel Corporation 3rd Gen Core processor Graphics Controller (rev 09) 01:00.0 VGA compatible controller: NVIDIA Corporation GF108 [Quadro NVS 5400M] (rev a1) $ uname -a Linux localhost.localdomain 3.8.11-200.fc18.x86_64 #1 SMP Wed May 1 19:44:27 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux I boot up the system with only laptop internal LVDS active on intel gpu, and when in X session, I enable the second display on nouveau gpu connected to the dock DP2/DVI connector. At this point the kernel crashes immediately. Screenshots of the kernel crash/stacktrace here: http://pasik.reaktio.net/nouveau/debug/nouveau-kernel-crash01.jpg http://pasik.reaktio.net/nouveau/debug/nouveau-kernel-crash02.jpg hand-written bits from the kernel crash: Pid: 1208, comm: Xorg Not tainted 3.8.11-200.fc18.x86_64 #1 LENOVO 2349H2G/2349H2G RIP: 0010:[<ffffffffa01abb7e>] [<ffffffffa01abb7e>] nvc0_vm_map_sg+0x8e/0x110 [nouveau] RSP: 0018:ffff8803298357c8 EFLAGS: 00010206 .. Process Xorg (pid: 1208,..) .. Call Trace: .. nouveau_vm_map_sg+0xc2/0x130 [nouveau] .. nouveau_vma_getmap.isra.11+0x68/0xa0 [nouveau] .. nouveau_bo_move_m2mf.isra.12+0x85/0x140 [nouveau] .. ? nouveau_vm_map_at+0x153/0x1c0 [nouveau] .. nouveau_bo_move+0xa5/0x400 [nouveau] .. ttm_bo_handle_move_mem+0x245/0x610 [ttm] More info from the screenshots above (also attached).. Kernel dmesg from before the crash attached. I also tried with Linux kernel 3.9.2 and there the system crashes hard aswell.
Created attachment 79550 [details] Screenshot 01 of the nouveau kernel crash with traceback
Created attachment 79551 [details] Screenshot 02 of the nouveau kernel crash with traceback
Would be nice to see the output of gdb /lib/modules/..../nvidia.ko disassemble nvc0_vm_map_sg Looks like there's a null deref in there (CR2 = 0). Although I think that optimus mode isn't supposed to be well-supported right now. But we still shouldn't be oopsing the kernel.
Ok, here goes: Reading symbols from /usr/lib/modules/3.8.11-200.fc18.x86_64/kernel/drivers/gpu/drm/nouveau/nouveau.ko...(no debugging symbols found)...done. Missing separate debuginfos, use: debuginfo-install kernel-3.8.11-200.fc18.x86_64 (gdb) disassemble nvc0_vm_map_sg Dump of assembler code for function nvc0_vm_map_sg: 0x0000000000023b20 <+0>: callq 0x23b25 <nvc0_vm_map_sg+5> 0x0000000000023b25 <+5>: push %rbp 0x0000000000023b26 <+6>: mov %rsp,%rbp 0x0000000000023b29 <+9>: push %r15 0x0000000000023b2b <+11>: push %r14 0x0000000000023b2d <+13>: push %r13 0x0000000000023b2f <+15>: mov %rsi,%r13 0x0000000000023b32 <+18>: push %r12 0x0000000000023b34 <+20>: push %rbx 0x0000000000023b35 <+21>: lea 0x0(,%rcx,8),%ebx 0x0000000000023b3c <+28>: sub $0x38,%rsp 0x0000000000023b40 <+32>: mov 0x30(%rdi),%esi 0x0000000000023b43 <+35>: mov %rdx,-0x40(%rbp) 0x0000000000023b47 <+39>: mov %rdi,-0x38(%rbp) 0x0000000000023b4b <+43>: mov %r9,-0x48(%rbp) 0x0000000000023b4f <+47>: lea -0x1(%r8),%edx 0x0000000000023b53 <+51>: mov %esi,%eax 0x0000000000023b55 <+53>: and $0x10,%eax 0x0000000000023b58 <+56>: cmp $0x1,%eax 0x0000000000023b5b <+59>: sbb %eax,%eax 0x0000000000023b5d <+61>: and $0xfffffffe,%eax 0x0000000000023b60 <+64>: add $0x7,%eax
here's the whole function: (gdb) disassemble nvc0_vm_map_sg Dump of assembler code for function nvc0_vm_map_sg: 0x0000000000023b20 <+0>: callq 0x23b25 <nvc0_vm_map_sg+5> 0x0000000000023b25 <+5>: push %rbp 0x0000000000023b26 <+6>: mov %rsp,%rbp 0x0000000000023b29 <+9>: push %r15 0x0000000000023b2b <+11>: push %r14 0x0000000000023b2d <+13>: push %r13 0x0000000000023b2f <+15>: mov %rsi,%r13 0x0000000000023b32 <+18>: push %r12 0x0000000000023b34 <+20>: push %rbx 0x0000000000023b35 <+21>: lea 0x0(,%rcx,8),%ebx 0x0000000000023b3c <+28>: sub $0x38,%rsp 0x0000000000023b40 <+32>: mov 0x30(%rdi),%esi 0x0000000000023b43 <+35>: mov %rdx,-0x40(%rbp) 0x0000000000023b47 <+39>: mov %rdi,-0x38(%rbp) 0x0000000000023b4b <+43>: mov %r9,-0x48(%rbp) 0x0000000000023b4f <+47>: lea -0x1(%r8),%edx 0x0000000000023b53 <+51>: mov %esi,%eax 0x0000000000023b55 <+53>: and $0x10,%eax 0x0000000000023b58 <+56>: cmp $0x1,%eax 0x0000000000023b5b <+59>: sbb %eax,%eax 0x0000000000023b5d <+61>: and $0xfffffffe,%eax 0x0000000000023b60 <+64>: add $0x7,%eax 0x0000000000023b63 <+67>: test %r8d,%r8d 0x0000000000023b66 <+70>: je 0x23c16 <nvc0_vm_map_sg+246> 0x0000000000023b6c <+76>: mov %rax,%rcx 0x0000000000023b6f <+79>: lea 0x4(%rbx),%eax 0x0000000000023b72 <+82>: lea 0x8(,%rdx,8),%rdx 0x0000000000023b7a <+90>: xor %r15d,%r15d 0x0000000000023b7d <+93>: shl $0x20,%rcx 0x0000000000023b81 <+97>: mov %eax,-0x5c(%rbp) 0x0000000000023b84 <+100>: mov %r13,%rax 0x0000000000023b87 <+103>: mov %rcx,-0x50(%rbp) 0x0000000000023b8b <+107>: mov %r15,%r13 0x0000000000023b8e <+110>: mov %rdx,-0x58(%rbp) 0x0000000000023b92 <+114>: mov %rax,%r15 0x0000000000023b95 <+117>: jmp 0x23ba7 <nvc0_vm_map_sg+135> 0x0000000000023b97 <+119>: nopw 0x0(%rax,%rax,1) 0x0000000000023ba0 <+128>: mov -0x38(%rbp),%rdx 0x0000000000023ba4 <+132>: mov 0x30(%rdx),%esi 0x0000000000023ba7 <+135>: mov -0x48(%rbp),%rcx 0x0000000000023bab <+139>: mov %r15,%rdi 0x0000000000023bae <+142>: mov (%rcx,%r13,1),%rax 0x0000000000023bb2 <+146>: shr $0x8,%rax 0x0000000000023bb6 <+150>: mov %rax,%rdx 0x0000000000023bb9 <+153>: or $0x3,%rax 0x0000000000023bbd <+157>: or $0x1,%rdx 0x0000000000023bc1 <+161>: and $0x4,%esi 0x0000000000023bc4 <+164>: lea 0x0(%r13,%rbx,1),%esi 0x0000000000023bc9 <+169>: cmovne %rax,%rdx 0x0000000000023bcd <+173>: mov -0x40(%rbp),%rax 0x0000000000023bd1 <+177>: mov 0xd8(%rax),%r14d 0x0000000000023bd8 <+184>: shl $0x24,%r14 0x0000000000023bdc <+188>: or -0x50(%rbp),%r14 0x0000000000023be0 <+192>: or %rdx,%r14 0x0000000000023be3 <+195>: mov (%r15),%rdx 0x0000000000023be6 <+198>: mov 0x8(%rdx),%r10 0x0000000000023bea <+202>: mov %r14d,%edx 0x0000000000023bed <+205>: callq *0x48(%r10) 0x0000000000023bf1 <+209>: mov (%r15),%rdx 0x0000000000023bf4 <+212>: mov -0x5c(%rbp),%esi 0x0000000000023bf7 <+215>: mov %r15,%rdi 0x0000000000023bfa <+218>: mov 0x8(%rdx),%r10 0x0000000000023bfe <+222>: mov %r14,%rdx 0x0000000000023c01 <+225>: add %r13d,%esi 0x0000000000023c04 <+228>: shr $0x20,%rdx 0x0000000000023c08 <+232>: add $0x8,%r13 0x0000000000023c0c <+236>: callq *0x48(%r10) 0x0000000000023c10 <+240>: cmp -0x58(%rbp),%r13 0x0000000000023c14 <+244>: jne 0x23ba0 <nvc0_vm_map_sg+128> 0x0000000000023c16 <+246>: add $0x38,%rsp 0x0000000000023c1a <+250>: pop %rbx 0x0000000000023c1b <+251>: pop %r12 0x0000000000023c1d <+253>: pop %r13 0x0000000000023c1f <+255>: pop %r14 0x0000000000023c21 <+257>: pop %r15 0x0000000000023c23 <+259>: pop %rbp 0x0000000000023c24 <+260>: retq End of assembler dump. (gdb)
Well, +0x8e is +142, and we see 0x0000000000023bae <+142>: mov (%rcx,%r13,1),%rax 0x0000000000023bb2 <+146>: shr $0x8,%rax 0x0000000000023bb6 <+150>: mov %rax,%rdx 0x0000000000023bb9 <+153>: or $0x3,%rax 0x0000000000023bbd <+157>: or $0x1,%rdx which I'm fairly sure corresponds to u64 phys = nvc0_vm_addr(vma, *list++, memtype, target); Since static inline u64 nvc0_vm_addr(struct nouveau_vma *vma, u64 phys, u32 memtype, u32 target) { phys >>= 8; phys |= 0x00000001; /* present */ if (vma->access & NV_MEM_ACCESS_SYS) phys |= 0x00000002; (And for some reason it splits the two branches into two separate registers... odd, but nothing else in the code matches up as nicely.) So that means that the passed in list pointer must be null. This corresponds to drivers/gpu/drm/nouveau/nouveau_bo.c:nouveau_vma_getmap which passes in mem->mm_node as the mem argument to vm_map_sg, which in turn does mem->pages. So perhaps add something to the top of nouveau_vma_getmap (before the vm_get call) like if (WARN_ON(!node->pages)) { return -EINVAL; } Which should help avoid the crash, but will not provide any additional functionality. You should then see a backtrace, but no crash.
Actually, that should probably be WARN_ON(mem->mem_type != TTM_PL_VRAM && !node->pages) otherwise you'll get a lot of spurious warns/failures.
I tested the most recent Fedora 18 kernel update (3.9.9-201.fc18.x86_64), and the bug is still there. I enabled Optimus in the BIOS, booted to Linux, tried to enable nouveau DVI output, and the kernel crashed immediately. Then I built a custom kernel based on 3.9.9-201.fc18.x86_64 with the attached patch included (as suggested above), and booted with that. Now there's a problem.. nouveau outputs are not detected, and cannot be enabled, so I don't know if the patch fixes or works around the problem. At least it breaks something :) There are no obvious nouveau related errors in the dmesg.. attached aswell.
Created attachment 82633 [details] [review] linux-3.9.9-gpu-drm-nouveau-warn-on-node-pages.patch
Created attachment 82635 [details] dmesg-3.9.9-201.dbg01-with-custom_warn-on-pages_patch-optimus_enabled-nouveau_does_not_work.txt
That last dmesg looks fine... what's the problem? (It seems unlikely that merely adding a WARN_ON that never fires would fix things, points to a compiler issue? Or a different config/compilation mechanism/??) You now have an intel and nvidia devices. I think it starts out with the intel device, you can use the regular vga_switcheroo mechanisms to switch to the nvidia device. Some docs about it are available at https://help.ubuntu.com/community/HybridGraphics. If you're aware of all that, can you be more specific as to what you're doing and what exactly doesn't work?
Ok, sorry for not explaining properly. With the stock fedora 3.9.9-201 kernel the nouveau dvi outputs are detected OK, so in the Mate desktop display properties application I can see the external monitor that is connected with DVI, and I can choose to enable it, which causes the kernel crash. Now with the custom 3.9.9-201.dbg01 kernel that I built, which only has the warn_on patch added on top of the stock kernel, the nouveau outputs are *not* detected at all, so I can't see the DVI monitor on the display properties application, and I can't enable the DVI monitor.. I tried rebooting the laptop multiple times, and the behaviour stays the same. Maybe I should play with xrandr aswell. And probably rebuild a new kernel, with that patch removed and see if it really causes the issue.. Thanks for your reply!
Outputs certainly *appear* to be detected. What makes you say they're not? [ 2.882654] nouveau [ DRM] DCB outp 00: 01800323 00010034 [ 2.882656] nouveau [ DRM] DCB outp 01: 02811300 00000000 [ 2.882657] nouveau [ DRM] DCB outp 02: 028223a6 0f220010 [ 2.882658] nouveau [ DRM] DCB outp 03: 02822362 00020010 [ 2.882659] nouveau [ DRM] DCB outp 04: 048333b6 0f220010 [ 2.882660] nouveau [ DRM] DCB outp 05: 04833372 00020010 [ 2.882662] nouveau [ DRM] DCB outp 06: 088443c6 0f220010 [ 2.882663] nouveau [ DRM] DCB outp 07: 08844382 00020010 [ 2.882664] nouveau [ DRM] DCB conn 00: 00000040 [ 2.882666] nouveau [ DRM] DCB conn 01: 00000100 [ 2.882667] nouveau [ DRM] DCB conn 02: 00110246 [ 2.882668] nouveau [ DRM] DCB conn 03: 00220346 [ 2.882669] nouveau [ DRM] DCB conn 04: 01400446 [ 3.023865] nouveau [ DRM] allocated 1920x1080 fb: 0x60000, bo ffff8803285f9c00 I assume you have a 1920x1080 screen hooked up to the DVI? You can look at the KMS status in /sys/class/drm/ -- you should see cardN-DVI-I-N directories. xrandr should also provide accurate status, e.g.: $ grep . /sys/class/drm/card*-*/status /sys/class/drm/card0-DVI-I-1/status:connected /sys/class/drm/card0-DVI-I-2/status:disconnected However it may end up being hidden due to vgaswitcheroo -- so make sure to look into that too. (Unfortunately I know next to nothing about vgaswitcheroo, but I'm sure Google knows more.)
Ok, after a lot of reboots and trying different kernels I realized the issue is not related to the warn_on patch I added. The difference is caused by the fact that if I boot with DVI monitor connected before powering on the system, or not. So what happens is: 1) I boot to Linux with DVI monitor connected to nouveau before powering on the system. This is how it looks then: $ grep . /sys/class/drm/card*-*/status /sys/class/drm/card0-LVDS-1/status:connected /sys/class/drm/card0-VGA-1/status:disconnected /sys/class/drm/card1-DP-1/status:disconnected /sys/class/drm/card1-DP-2/status:connected /sys/class/drm/card1-DP-3/status:disconnected /sys/class/drm/card1-LVDS-2/status:disconnected /sys/class/drm/card1-VGA-2/status:disconnected DP-2 is the external DVI monitor. So it's listed as "connected", but xrandr doesn't see it: $ xrandr Screen 0: minimum 320 x 200, current 1600 x 900, maximum 8192 x 8192 LVDS1 connected 1600x900+0+0 (normal left inverted right x axis y axis) 309mm x 174mm 1600x900 60.0*+ 40.0 1024x768 60.0 800x600 60.3 56.2 640x480 59.9 VGA1 disconnected (normal left inverted right x axis y axis) Nothing about the nouveau adapter in xrandr output.. 2) I boot to Linux with DVI monitor NOT connected to nouveau. I connect the DVI cable after the system has booted to Xorg. This is how it looks then: $ grep . /sys/class/drm/card*-*/status /sys/class/drm/card0-LVDS-1/status:connected /sys/class/drm/card0-VGA-1/status:disconnected /sys/class/drm/card1-DP-1/status:disconnected /sys/class/drm/card1-DP-2/status:connected /sys/class/drm/card1-DP-3/status:disconnected /sys/class/drm/card1-LVDS-2/status:disconnected /sys/class/drm/card1-VGA-2/status:disconnected So again DP-2 is shown as connected to the DVI monitor, which is true. $ xrandr Screen 0: minimum 320 x 200, current 1600 x 900, maximum 8192 x 8192 LVDS1 connected 1600x900+0+0 (normal left inverted right x axis y axis) 309mm x 174mm 1600x900 60.0*+ 40.0 1024x768 60.0 800x600 60.3 56.2 640x480 59.9 VGA1 disconnected (normal left inverted right x axis y axis) LVDS-2 disconnected (normal left inverted right x axis y axis) VGA-2 disconnected (normal left inverted right x axis y axis) DP-1 disconnected (normal left inverted right x axis y axis) DP-2 connected (normal left inverted right x axis y axis) 1920x1080 59.9 + 1600x1200 60.0 1680x1050 59.9 1280x1024 60.0 1440x900 59.9 1280x960 60.0 1280x800 59.9 1024x768 60.0 800x600 60.3 56.2 640x480 60.0 DP-3 disconnected (normal left inverted right x axis y axis) 1024x768 (0x8e) 65.0MHz h: width 1024 start 1048 end 1184 total 1344 skew 0 clock 48.4KHz v: height 768 start 771 end 777 total 806 clock 60.0Hz 800x600 (0x8f) 40.0MHz h: width 800 start 840 end 968 total 1056 skew 0 clock 37.9KHz v: height 600 start 601 end 605 total 628 clock 60.3Hz 800x600 (0x90) 36.0MHz h: width 800 start 824 end 896 total 1024 skew 0 clock 35.2KHz v: height 600 start 601 end 603 total 625 clock 56.2Hz And now xrandr can see the DVI monitor aswell. So when I plug-in the DVI cable on-the-fly to nouveau adapter, then the external display is actually detected, and can be enabled and used (with artifacts/corruption, but that is a separate bug ;) So 'xrandr' shows the same problem as I described earlier with the Mate desktop display-properties GUI application. When the laptop is booted with DVI cable plugged in from the start, the mate-display-properties GUI doesn't allow me to enable the DVI monitor - so the behaviour matches xrandr. I assume mate-display-properties internally uses xrandr. But this sounds like an another bug really..
xrandr not working if booted with DVI cable connected doesn't seem to be the case *always*.. it's seems to be a bit random. one boot out of 10, or so, I'm able to see the nouveau connectors on xrandr, so I'm able to try to enable them, and thus hit the warn_on in the patch I added to the kernel. So here goes: [ 168.514913] ------------[ cut here ]------------ [ 168.514954] WARNING: at drivers/gpu/drm/nouveau/nouveau_bo.c:952 nouveau_vma_getmap.isra.11+0xc1/0xe0 [nouveau]() [ 168.514956] Hardware name: 2349H2G [ 168.514957] Modules linked in: fuse ebtable_nat ipt_MASQUERADE nf_conntrack_netbios_ns nf_conntrack_broadcast ip6 table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle nf_conntrack _ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack be2iscsi iscsi_boot_sysfs bnx2i ebtable_filter cnic ebtables uio ip6t able_filter cxgb4i cxgb4 ip6_tables cxgb3i cxgb3 mdio libcxgbi ib_iser rdma_cm ib_addr iw_cm ib_cm ib_sa ib_mad ib_c ore iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi rfcomm bnep snd_hda_codec_realtek iTCO_wdt iTCO_vendor_support arc4 iwldvm acpi_cpufreq mac80211 mperf coretemp uvcvideo microcode snd_hda_intel videobuf2_vmalloc snd_hda_codec i2c_i801 videobuf2_memops snd_hwdep videobuf2_core snd_seq iwlwifi snd_seq_device videodev cdc_mbim btusb snd_pcm [ 168.514987] media bluetooth cfg80211 cdc_ncm usbnet cdc_wdm cdc_acm mii lpc_ich e1000e mfd_core snd_page_alloc snd_timer mei ptp thinkpad_acpi pps_core snd soundcore rfkill vhost_net tun macvtap macvlan kvm_intel kvm uinput dm_crypt crc32_pclmul crc32c_intel nouveau i915 ghash_clmulni_intel mxm_wmi sdhci_pci i2c_algo_bit sdhci ttm drm_kms_helper mmc_core drm i2c_core wmi video [ 168.515008] Pid: 914, comm: Xorg Not tainted 3.9.9-201.dbg01.fc18.x86_64 #1 [ 168.515009] Call Trace: [ 168.515017] [<ffffffff8105efc5>] warn_slowpath_common+0x75/0xa0 [ 168.515019] [<ffffffff8105f00a>] warn_slowpath_null+0x1a/0x20 [ 168.515036] [<ffffffffa020d801>] nouveau_vma_getmap.isra.11+0xc1/0xe0 [nouveau] [ 168.515046] [<ffffffffa019292a>] ? _nouveau_gpuobj_wr32+0x2a/0x30 [nouveau] [ 168.515062] [<ffffffffa020d8a6>] nouveau_bo_move_m2mf.isra.12+0x86/0x140 [nouveau] [ 168.515077] [<ffffffffa01b6b23>] ? nouveau_vm_map_at+0x153/0x1c0 [nouveau] [ 168.515092] [<ffffffffa020e2aa>] nouveau_bo_move+0x9a/0x400 [nouveau] [ 168.515100] [<ffffffffa007ee15>] ttm_bo_handle_move_mem+0x245/0x610 [ttm] [ 168.515106] [<ffffffffa007fd00>] ? ttm_bo_mem_space+0x180/0x360 [ttm] [ 168.515112] [<ffffffffa007fff7>] ttm_bo_move_buffer+0x117/0x130 [ttm] [ 168.515116] [<ffffffff8122b1fa>] ? ext4_dirty_inode+0x5a/0x70 [ 168.515125] [<ffffffffa00800aa>] ttm_bo_validate+0x9a/0x110 [ttm] [ 168.515140] [<ffffffffa020eb1c>] nouveau_bo_validate+0x1c/0x20 [nouveau] [ 168.515155] [<ffffffffa020ed4b>] nouveau_bo_pin+0x9b/0x100 [nouveau] [ 168.515158] [<ffffffff8130c6b4>] ? snprintf+0x34/0x40 [ 168.515173] [<ffffffffa0231b25>] nv50_crtc_mode_set_base+0x55/0xf0 [nouveau] [ 168.515180] [<ffffffffa007320b>] drm_crtc_helper_set_config+0x77b/0xb30 [drm_kms_helper] [ 168.515194] [<ffffffffa003c76e>] drm_mode_set_config_internal+0x2e/0x60 [drm] [ 168.515204] [<ffffffffa003eecc>] drm_mode_setcrtc+0x10c/0x570 [drm] [ 168.515208] [<ffffffff8165f88d>] ? mutex_lock+0x1d/0x50 [ 168.515217] [<ffffffffa002f483>] drm_ioctl+0x4d3/0x580 [drm] [ 168.515220] [<ffffffff81166dc5>] ? mmap_region+0x1c5/0x590 [ 168.515230] [<ffffffffa003edc0>] ? drm_mode_setplane+0x3b0/0x3b0 [drm] [ 168.515234] [<ffffffff811b1a97>] do_vfs_ioctl+0x97/0x580 [ 168.515238] [<ffffffff812a185a>] ? inode_has_perm.isra.32.constprop.62+0x2a/0x30 [ 168.515240] [<ffffffff812a2ee7>] ? file_has_perm+0x97/0xb0 [ 168.515243] [<ffffffff811b2011>] sys_ioctl+0x91/0xb0 [ 168.515247] [<ffffffff8166afd9>] system_call_fastpath+0x16/0x1b [ 168.515249] ---[ end trace 951f146548a30c98 ]--- I'll also attach the whole dmesg.
Created attachment 82692 [details] dmesg-3.9.9-201.dbg01.fc18.x86_64-optimus_enabled-warn-traceback-after-trying-to-enable-dvi-monitor
Oh, one more interesting piece of information.. If I boot (power-on) the laptop *without* DVI cable connected, the nouveau connectors will show up in 'xrandr' output as DP-1, DP-2 and DP-3. Now If i boot (power-on) the laptop *with* DVI cable connected, the nouveau connectors will *NOT* show up in 'xrandr' most of the times, but during that maybe one time out of ten when they do show up in 'xrandr', they're actually called DisplayPort-0, DisplayPort-1 and DisplayPort-2. I wonder why that difference.. and if it's related to the issue I'm having or not. I'm only able to hit the nouveau kernel crash when the connectors are called DisplayPort-X, it seems. When the connectors are called DP-X, I'm not able to crash the kernel.
I think I'll open a separate bug about the non-consistent/random output detection for xrandr and changing connector names. So let's focus on the kernel crash and troubleshooting that in this bug. Any comments about the WARNING traceback I posted?
Ok separate bug opened: "nouveau inconsistent changing output connector names in xrandr": https://bugs.freedesktop.org/show_bug.cgi?id=68075 So let's focus on the kernel crash on this bug. I just verified the same kernel crash also happens on Linux 3.10.4.
I built custom Linux 3.10.6 kernel with the patch from this bug's attachments (linux-3.9.9-gpu-drm-nouveau-warn-on-node-pages.patch) applied to add WARN_ON() to nouveau_vma_getmap() and I'm still hitting the same issue when enabling an output on nouveau gpu, which would lead to a hard kernel crash without the patch: [ 54.324698] ------------[ cut here ]------------ [ 54.324764] WARNING: at drivers/gpu/drm/nouveau/nouveau_bo.c:952 nouveau_vma_getmap.isra.11+0xc1/0xe0 [nouveau]() [ 54.324767] Modules linked in: fuse ebtable_nat ipt_MASQUERADE nf_conntrack_netbios_ns nf_conntrack_broadcast ip6table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ebtable_filter ebtables ip6table_filter ip6_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i cxgb3 mdio libcxgbi ib_iser rdma_cm ib_addr iw_cm ib_cm ib_sa ib_mad ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi rfcomm bnep snd_hda_codec_realtek iTCO_wdt iTCO_vendor_support acpi_cpufreq mperf coretemp microcode i2c_i801 arc4 snd_hda_intel uvcvideo snd_hda_codec btusb videobuf2_vmalloc videobuf2_memops bluetooth videobuf2_core videodev snd_hwdep media snd_seq iwldvm snd_seq_device cdc_mbim mac80211 [ 54.324820] snd_pcm lpc_ich mfd_core snd_page_alloc cdc_ncm usbnet iwlwifi cdc_wdm mii cdc_acm cfg80211 snd_timer e1000e ptp mei_me mei thinkpad_acpi pps_core snd soundcore rfkill vhost_net tun macvtap macvlan kvm_intel kvm uinput dm_crypt crc32_pclmul crc32c_intel nouveau i915 ghash_clmulni_intel mxm_wmi ttm sdhci_pci i2c_algo_bit sdhci drm_kms_helper mmc_core drm wmi i2c_core video [ 54.324858] CPU: 0 PID: 1089 Comm: Xorg Not tainted 3.10.6-100.dbg01.fc18.x86_64 #1 [ 54.324861] Hardware name: LENOVO 2349H2G/2349H2G, BIOS G1ET93WW (2.53 ) 03/08/2013 [ 54.324863] 0000000000000009 ffff880316abd808 ffffffff81656006 ffff880316abd848 [ 54.324869] ffffffff8105d660 ffff880316abd858 ffff88032a772458 ffff8803063c0a00 [ 54.324873] ffff8803063c0a40 ffff880316abda58 ffff88032813c170 ffff880316abd858 [ 54.324906] Call Trace: [ 54.324939] [<ffffffff81656006>] dump_stack+0x19/0x1b [ 54.324950] [<ffffffff8105d660>] warn_slowpath_common+0x70/0xa0 [ 54.324957] [<ffffffff8105d6aa>] warn_slowpath_null+0x1a/0x20 [ 54.325002] [<ffffffffa0229961>] nouveau_vma_getmap.isra.11+0xc1/0xe0 [nouveau] [ 54.325030] [<ffffffffa01ad94a>] ? _nouveau_gpuobj_wr32+0x2a/0x30 [nouveau] [ 54.325068] [<ffffffffa0229a06>] nouveau_bo_move_m2mf.isra.12+0x86/0x140 [nouveau] [ 54.325106] [<ffffffffa01cf633>] ? nouveau_vm_map_at+0x153/0x1c0 [nouveau] [ 54.325146] [<ffffffffa022a415>] nouveau_bo_move+0xa5/0x400 [nouveau] [ 54.325169] [<ffffffffa0088e7d>] ttm_bo_handle_move_mem+0x25d/0x630 [ttm] [ 54.325185] [<ffffffffa0089d90>] ? ttm_bo_mem_space+0x180/0x360 [ttm] [ 54.325200] [<ffffffffa008a097>] ttm_bo_move_buffer+0x127/0x140 [ttm] [ 54.325224] [<ffffffffa008a14a>] ttm_bo_validate+0x9a/0x110 [ttm] [ 54.325261] [<ffffffffa022ac4c>] nouveau_bo_validate+0x1c/0x20 [nouveau] [ 54.325299] [<ffffffffa022ae7b>] nouveau_bo_pin+0x9b/0x100 [nouveau] [ 54.325307] [<ffffffff81301d64>] ? snprintf+0x34/0x40 [ 54.325348] [<ffffffffa024da95>] nv50_crtc_mode_set_base+0x55/0xf0 [nouveau] [ 54.325364] [<ffffffffa006d31b>] drm_crtc_helper_set_config+0x79b/0xb60 [drm_kms_helper] [ 54.325396] [<ffffffffa003640e>] drm_mode_set_config_internal+0x2e/0x60 [drm] [ 54.325420] [<ffffffffa0038a0b>] drm_mode_setcrtc+0xfb/0x620 [drm] [ 54.325428] [<ffffffff81658fcd>] ? mutex_lock+0x1d/0x50 [ 54.325449] [<ffffffffa0029479>] drm_ioctl+0x549/0x680 [drm] [ 54.325474] [<ffffffffa0038910>] ? drm_mode_setplane+0x3b0/0x3b0 [drm] [ 54.325485] [<ffffffff811af2a7>] do_vfs_ioctl+0x97/0x580 [ 54.325494] [<ffffffff81296daa>] ? inode_has_perm.isra.33.constprop.63+0x2a/0x30 [ 54.325501] [<ffffffff81298367>] ? file_has_perm+0x97/0xb0 [ 54.325508] [<ffffffff8119fca5>] ? __sb_end_write+0x35/0x70 [ 54.325514] [<ffffffff811af821>] SyS_ioctl+0x91/0xb0 [ 54.325524] [<ffffffff81664659>] system_call_fastpath+0x16/0x1b [ 54.325528] ---[ end trace d7d4b0036de28fa5 ]---
Any comments about this issue? The patch (linux-3.9.9-gpu-drm-nouveau-warn-on-node-pages.patch) from comment #9 fixes the reproducible hard kernel crash for me.. should we push this to Linux, or should we dig deeper about why nouveau_vma_getmap() is called with wrong arguments resulting in null pointer crash?
We should figure out why it's getting stuff that it doesn't expect. I had suggested adding the warn solely to avoid the crash down the line. I guess it's reasonable for the upstream kernel to have that, as if it ever fires, we're going to be in trouble later... I think. Will have to re-check the code. I can post it with my analysis if you like, or you can. Could you retest this with 3.11-rc6 BTW? The whole fb pinning/etc logic was reworked, and your issue may have been fixed by it.
Feel free to post it with your analysis! You can add my reported-by and tested-by. I'm able to reproduce the crash with Linux 3.8.x, 3.9.x and 3.10.x, so the patch is needed at least on the 3.10.x stable branch to fix the crash there. I'll also test with 3.11-rc.
Some comments here.. this bug is still being discussed on the mailing lists, and Ben Skeggs said he'll take a look at the kernel side fixes soon. Also good to know: I'm able to reliably reproduce this kernel crash with Fedora 18 userspace/xorg, but not with Fedora 19 userspace/xorg.
fc20 crash: https://bugzilla.redhat.com/show_bug.cgi?id=1047169
FWIW, my patch, which was deemed inappropriate for inclusion: http://marc.info/?l=dri-devel&m=137713025702333&w=2 Later in the thread, another patch was proposed: http://marc.info/?l=dri-devel&m=137715564208095&w=2 However I don't think it has made it upstream yet as Ben had some concerns.
(In reply to comment #26) > FWIW, my patch, which was deemed inappropriate for inclusion: > http://marc.info/?l=dri-devel&m=137713025702333&w=2 > > Later in the thread, another patch was proposed: > http://marc.info/?l=dri-devel&m=137715564208095&w=2 > > However I don't think it has made it upstream yet as Ben had some concerns. the latest patch mentioned on lkml thread only changes hard system locks into soft locks for kernel-3.12.6-300.fc20.x86_64. the suspend/resume cycles works pretty stable for rawhide kernel-3.13.0-0.rc7.git0.2.fc21.x86_64. afaics, the 3.13 kernel contains some acpi/pm fixes in the nouveau area.
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/xorg/driver/xf86-video-nouveau/issues/44.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.