Bug 70388

Summary: [NV34] failed to idle channel 0xcccc0000 (also NV44A)
Product: Mesa Reporter: Rostislav Krasny <rosti.bsd>
Component: Drivers/DRI/nouveauAssignee: Nouveau Project <nouveau>
Status: RESOLVED MOVED QA Contact:
Severity: critical    
Priority: high CC: auxsvr, fdsfgs, linasvepstas, lukas, rosti.bsd, sergstepanenko
Version: 9.2   
Hardware: x86 (IA32)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments: full dmesg
second failure
Full dmesg output after the crash
journalctl (same to /var/log/messages) chunk, related to the crash
syslog showing nouveau +X struggling to come up, lots of errors
lspci -vnn for above (and for successful dmesg below)
dmesg for successful boot, using vram_pushbuf=1

Description Rostislav Krasny 2013-10-11 19:17:31 UTC
I have NV34 FX5500 based AGP videocard. It works properly with old (2D only?) WM and DE like xfce, lxde and MATE. But with GNOME 3.8, Cinnamon and KDE 4.11 (that use 3D capabilities?) it fails (after 30 seconds upto 2 minutes) with the "failed to idle channel 0xcccc0000" in the logs. Tested with kernel 3.10 and 3.11 on Fedora 19 and Arch Linux. Also, se following dmesg logs made in Fedora:

[rosti@localhost ~]$ dmesg | grep nouve
[    1.340639] nouveau  [  DEVICE][0000:01:00.0] BOOT0  : 0x034600b1
[    1.340649] nouveau  [  DEVICE][0000:01:00.0] Chipset: NV34 (NV34)
[    1.340654] nouveau  [  DEVICE][0000:01:00.0] Family : NV30
[    1.342359] nouveau  [   VBIOS][0000:01:00.0] checking PRAMIN for image...
[    1.386401] nouveau  [   VBIOS][0000:01:00.0] ... appears to be valid
[    1.386407] nouveau  [   VBIOS][0000:01:00.0] using image from PRAMIN
[    1.386411] nouveau  [   VBIOS][0000:01:00.0] BMP version 5.28
[    1.386519] nouveau  [   VBIOS][0000:01:00.0] version 04.34.20.69.00
[    1.386757] nouveau W[  PTIMER][0000:01:00.0] unknown input clock freq
[    1.386770] nouveau  [     PFB][0000:01:00.0] RAM type: DDR1
[    1.386776] nouveau  [     PFB][0000:01:00.0] RAM size: 256 MiB
[    1.386781] nouveau  [     PFB][0000:01:00.0]    ZCOMP: 0 tags
[    1.390494] nouveau 0000:01:00.0: putting AGP V3 device into 8x mode
[    1.390625] nouveau  [     DRM] VRAM: 255 MiB
[    1.390629] nouveau  [     DRM] GART: 128 MiB
[    1.390632] nouveau  [     DRM] BMP version 5.40
[    1.390637] nouveau  [     DRM] DCB version 2.2
[    1.390642] nouveau  [     DRM] DCB outp 00: 01000300 000088b8
[    1.390645] nouveau  [     DRM] DCB outp 01: 02010310 000088b8
[    1.390648] nouveau  [     DRM] DCB outp 02: 01010312 00000000
[    1.390652] nouveau  [     DRM] DCB outp 03: 02020321 00000303
[    1.390771] nouveau  [     DRM] Loading NV17 power sequencing microcode
[    1.391050] nouveau  [     DRM] Saving VGA fonts
[    1.438608] nouveau  [     DRM] 0xE078: Parsing digital output script table
[    1.489190] nouveau  [     DRM] 0 available performance level(s)
[    1.489197] nouveau  [     DRM] c: core 249MHz memory 265MHz
[    1.492860] nouveau  [     DRM] MM: using M2MF for buffer copies
[    1.492876] nouveau  [     DRM] Setting dpms mode 3 on TV encoder (output 3)
[    1.555055] nouveau  [     DRM] allocated 1920x1080 fb: 0x9000, bo f71f5200
[    1.555164] fbcon: nouveaufb (fb0) is primary device
[    1.584535] nouveau 0000:01:00.0: fb0: nouveaufb frame buffer device
[    1.584541] nouveau 0000:01:00.0: registered panic notifier
[    1.586481] [drm] Initialized nouveau 1.1.1 20120801 for 0000:01:00.0 on minor 0
Comment 1 Ilia Mirkin 2013-10-11 19:36:38 UTC
What version of Mesa are you using? Also, what process owns 0xcccc0000 -- I assume X? The full dmesg would definitely be nice to have. What chipset do you have (lspci -nn wouldn't hurt)? Does setting agpmode=4 or agpmode=0 help at all?

Another idea -- add MESA_EXTENSION_OVERRIDE=-GL_ARB_framebuffer_object to your system environment (make sure that the DM gets it).

FWIW, my personal experience with a NV34 (PCI-E) is that as soon as I open chrome or firefox, it hangs hard; before then, no issues. And that's without any sort of compositor/etc in the fray.
Comment 2 Rostislav Krasny 2013-10-12 16:02:16 UTC
(In reply to comment #1)
> What version of Mesa are you using?

9.2.0. The latest packages of Fedora 19:

[rosti@localhost ~]$ rpm -qa | grep mesa
mesa-libgbm-9.2-1.20130919.fc19.i686
mesa-dri-drivers-9.2-1.20130919.fc19.i686
mesa-libGLU-9.0.0-2.fc19.i686
mesa-libxatracker-9.2-1.20130919.fc19.i686
mesa-libwayland-egl-9.2-1.20130919.fc19.i686
mesa-filesystem-9.2-1.20130919.fc19.i686
mesa-libEGL-9.2-1.20130919.fc19.i686
mesa-libglapi-9.2-1.20130919.fc19.i686
mesa-libGL-9.2-1.20130919.fc19.i686

Following logs might be intersting as well:

Oct 12 17:29:59 localhost.localdomain xinit[275]: kdeinit4: Aborting. $HOME not set!
Oct 12 17:29:59 localhost.localdomain xinit[275]: (__main__.py:364): Gtk-CRITICAL **: gtk_box_reorder_child: assertion `old_link != NULL' failed
Oct 12 17:29:59 localhost.localdomain xinit[275]: kdeinit4: Aborting. $HOME not set!kwin(362): Couldn't start kglobalaccel from kglobalaccel.desktop:  "KLaunc
Oct 12 17:29:59 localhost.localdomain xinit[275]: The name org.kde.klauncher was not provided by any .service files
Oct 12 17:29:59 localhost.localdomain xinit[275]: "
Oct 12 17:30:01 localhost.localdomain xinit[275]: kdeinit4: Aborting. $HOME not set!kwin(362): No ksycoca4 database available!
Oct 12 17:30:01 localhost.localdomain xinit[275]: kwin(362)/kdecore (trader) KServiceTypeTrader::defaultOffers: KServiceTypeTrader: serviceType  "KWin/Script"
Oct 12 17:30:05 localhost.localdomain xinit[275]: [<class 'pyanaconda.ui.gui.spokes.password.PasswordSpoke'>, <class 'pyanaconda.ui.gui.spokes.user.UserSpoke'
Oct 12 17:30:05 localhost.localdomain xinit[275]: OpenGL vendor string:                   nouveau
Oct 12 17:30:05 localhost.localdomain xinit[275]: OpenGL renderer string:                 Gallium 0.4 on NV34
Oct 12 17:30:05 localhost.localdomain xinit[275]: OpenGL version string:                  1.5 Mesa 9.2.0
Oct 12 17:30:05 localhost.localdomain xinit[275]: OpenGL shading language version string: 1.20
Oct 12 17:30:05 localhost.localdomain xinit[275]: Driver:                                 Nouveau
Oct 12 17:30:05 localhost.localdomain xinit[275]: GPU class:                              NV30
Oct 12 17:30:05 localhost.localdomain xinit[275]: OpenGL version:                         1.5
Oct 12 17:30:05 localhost.localdomain xinit[275]: GLSL version:                           1.20
Oct 12 17:30:05 localhost.localdomain xinit[275]: Mesa version:                           9.2
Oct 12 17:30:05 localhost.localdomain xinit[275]: Linux kernel version:                   3.11.3
Oct 12 17:30:05 localhost.localdomain xinit[275]: Direct rendering:                       yes
Oct 12 17:30:05 localhost.localdomain xinit[275]: Requires strict binding:                yes
Oct 12 17:30:05 localhost.localdomain xinit[275]: GLSL shaders:                           limited
Oct 12 17:30:05 localhost.localdomain xinit[275]: Texture NPOT support:                   no
Oct 12 17:30:05 localhost.localdomain xinit[275]: Virtual Machine:                        no
Oct 12 17:30:05 localhost.localdomain xinit[275]: kdeinit4: Aborting. $HOME not set!kwin(362): No ksycoca4 database available!
Oct 12 17:30:05 localhost.localdomain xinit[275]: kwin(362)/kdecore (trader) KServiceTypeTrader::defaultOffers: KServiceTypeTrader: serviceType  "KWin/Effect"
Oct 12 17:30:05 localhost.localdomain systemd[1]: Reloading.
Oct 12 17:30:05 localhost.localdomain LVM: Logical Volume autoactivation enabled.
Oct 12 17:30:05 localhost.localdomain LVM: Activation generator successfully completed.
Oct 12 17:30:06 localhost.localdomain xinit[275]: /bin/xinit: connection to X server lost
Oct 12 17:30:06 localhost.localdomain xinit[275]: waiting for X server to shut down kwin: Fatal IO error: client killed
Oct 12 17:30:06 localhost.localdomain xinit[275]: (EE) QThreadStorage: Thread 0x8ed0d40 exited after QThreadStorage 32 destroyed
Oct 12 17:30:06 localhost.localdomain xinit[275]: QThreadStorage: Thread 0x8ed0d40 exited after QThreadStorage 31 destroyed
Oct 12 17:30:06 localhost.localdomain xinit[275]: Server terminated successfully (0). Closing log file.


> Also, what process owns 0xcccc0000 -- I assume X?

Do you mean the process name written in the "failed to idle channel" error message? It depends on the DE. Currently I've Fedora 19 with KDE DE installed. It's KDE 4.11.2. And following logs are taken from journalctl -a

Oct 12 18:18:18 localhost.localdomain kernel: nouveau E[kwin[2776]] failed to idle channel 0xcccc0000 [kwin[2776]]
Oct 12 18:18:33 localhost.localdomain kernel: nouveau E[kwin[2776]] failed to idle channel 0xcccc0000 [kwin[2776]]
Oct 12 18:18:48 localhost.localdomain kernel: nouveau E[ X[2521]] failed to idle channel 0xcccc0000 [X[2521]]
Oct 12 18:18:48 localhost.localdomain systemd[1]: Started Show Plymouth Reboot Screen.
Oct 12 18:19:03 localhost.localdomain kernel: nouveau E[ X[2521]] failed to idle channel 0xcccc0000 [X[2521]]

KDE was stuck again. I was able only to move the mouse and to switch to console terminals (Ctlr-Aft-Fn). So after some time (5 - 10 minutes) I switched to console and ran 'shutdown -r now'. As you can see the above logs were written during the shutdown. And there are two processes owned the 0xcccc0000: kwin and X. Also there could be the same failure of idling channel but whithout any process name within the square brackets. I had seen that on Arch Linux.

> The full dmesg would definitely be nice to have.

I will attach it.

> What chipset do you have (lspci -nn wouldn't hurt)?

[rosti@localhost ~]$ lspci -nn
00:00.0 Host bridge [0600]: Intel Corporation 82865G/PE/P DRAM Controller/Host-Hub Interface [8086:2570] (rev 02)
00:01.0 PCI bridge [0604]: Intel Corporation 82865G/PE/P AGP Bridge [8086:2571] (rev 02)
00:06.0 System peripheral [0880]: Intel Corporation 82865G/PE/P Processor to I/O Memory Interface [8086:2576] (rev 02)
00:1d.0 USB controller [0c03]: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #1 [8086:24d2] (rev 02)
00:1d.1 USB controller [0c03]: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #2 [8086:24d4] (rev 02)
00:1d.2 USB controller [0c03]: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #3 [8086:24d7] (rev 02)
00:1d.3 USB controller [0c03]: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #4 [8086:24de] (rev 02)
00:1d.7 USB controller [0c03]: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB2 EHCI Controller [8086:24dd] (rev 02)
00:1e.0 PCI bridge [0604]: Intel Corporation 82801 PCI Bridge [8086:244e] (rev c2)
00:1f.0 ISA bridge [0601]: Intel Corporation 82801EB/ER (ICH5/ICH5R) LPC Interface Bridge [8086:24d0] (rev 02)
00:1f.1 IDE interface [0101]: Intel Corporation 82801EB/ER (ICH5/ICH5R) IDE Controller [8086:24db] (rev 02)
00:1f.3 SMBus [0c05]: Intel Corporation 82801EB/ER (ICH5/ICH5R) SMBus Controller [8086:24d3] (rev 02)
00:1f.5 Multimedia audio controller [0401]: Intel Corporation 82801EB/ER (ICH5/ICH5R) AC'97 Audio Controller [8086:24d5] (rev 02)
01:00.0 VGA compatible controller [0300]: NVIDIA Corporation NV34 [GeForce FX 5500] [10de:0326] (rev a1)
02:09.0 Ethernet controller [0200]: Marvell Technology Group Ltd. 88E8001 Gigabit Ethernet Controller [11ab:4320] (rev 13)

> Does setting agpmode=4 or agpmode=0 help at all?

Didn't try yet but I will. Where to set it?

> Another idea -- add MESA_EXTENSION_OVERRIDE=-GL_ARB_framebuffer_object to
> your system environment (make sure that the DM gets it).

Do you mean setting it in the ~/.bashrc or where?

> FWIW, my personal experience with a NV34 (PCI-E) is that as soon as I open
> chrome or firefox, it hangs hard; before then, no issues. And that's without
> any sort of compositor/etc in the fray.

I use Firefox 24.0 right now and it doesn't make any problem with nouveau. Actually I'm in KDE Plasma Workspace (failsafe session). This failsafe mode disables something in KDE that allows me to use KDE without nouveau failure.
Comment 3 Rostislav Krasny 2013-10-12 16:04:09 UTC
Created attachment 87515 [details]
full dmesg

Full dmesg output
Comment 4 Rostislav Krasny 2013-10-12 16:50:31 UTC
Created attachment 87518 [details]
second failure

Adding 'export MESA_EXTENSION_OVERRIDE=-GL_ARB_framebuffer_object' into ~/.bashrc didn't help. KDE became stuck a few seconds after login again. After I switched to console terminal the "failed to idle channel 0xcccc0000" failure happened again. See the aatached log file, taken from 'journalctl -a' output. Mentioned in the log dump will be attached in the next message.
Comment 5 Rostislav Krasny 2013-10-12 17:04:19 UTC
(In reply to comment #4)
> Mentioned in the log dump will be attached in the next message.

The dump archive is too big for attachment (more then 5MB). Tell me if you really need it and then I will find a way to send it.
Comment 6 Ilia Mirkin 2013-10-13 01:28:57 UTC
For setting agpmode:

Add nouveau.agpmode=4 (or =0) on the kernel command line.

For the log:

I have no idea what journalctl -a gives you, but it certainly sounds like it's not what we're looking for. Having a kernel log that includes the various hangs would be super.
Comment 7 Rostislav Krasny 2013-10-13 20:33:09 UTC
(In reply to comment #6)
> Having a kernel log that includes the various hangs would be super.

Where it's located or/and how to enable it?
Comment 8 Ilia Mirkin 2013-10-15 23:33:49 UTC
(In reply to comment #7)
> (In reply to comment #6)
> > Having a kernel log that includes the various hangs would be super.
> 
> Where it's located or/and how to enable it?

Depending on circumstances, you might find it in /var/log/messages, or if you can ssh in while the machine is "hung", just run dmesg, or if you can't, set up netconsole.
Comment 9 Rostislav Krasny 2013-10-18 18:57:06 UTC
(In reply to comment #8)
> Depending on circumstances, you might find it in /var/log/messages, or if
> you can ssh in while the machine is "hung", just run dmesg, or if you can't,
> set up netconsole.

The /var/log/messages contains the same output to that of journalctl. journalctl is the systemd way of working with the system logs.

dmesg also doesn't have much information about this crash. I will attach a dmesg2.txt and a journalctl2.txt files in the next messages.

The nouveau Bugs wiki (http://nouveau.freedesktop.org/wiki/Bugs/) suggests to add log_buf_len=1M into the kernel parameters. But I doubt it would help in my case, would it? Maybe there is some other kernel/mesa parameter that will change the log level to something more informative?
Comment 10 Rostislav Krasny 2013-10-18 18:58:49 UTC
Created attachment 87833 [details]
Full dmesg output after the crash
Comment 11 Rostislav Krasny 2013-10-18 19:00:14 UTC
Created attachment 87834 [details]
journalctl (same to /var/log/messages) chunk, related to the crash
Comment 12 Rostislav Krasny 2013-10-18 19:52:37 UTC
(In reply to comment #6)
> For setting agpmode:
> 
> Add nouveau.agpmode=4 (or =0) on the kernel command line.

I tried both and neither had helped.

Fortunatelly I've found backtraces like following in /var/log/kdm.log-20131014 and in /var/log/kdm.log

(EE) [mi] EQ overflowing.  Additional events will be discarded until existing events are processed.
(EE)
(EE) Backtrace:
(EE) 0: /usr/bin/X (mieqEnqueue+0x21b) [0x81d417b]
(EE) 1: /usr/bin/X (QueuePointerEvents+0x6c) [0x8092c5c]
(EE) 2: /usr/bin/X (xf86PostMotionEventM+0x253) [0x80e06d3]
(EE) 3: /usr/lib/xorg/modules/input/evdev_drv.so (_init+0x31c7) [0xb70d7147]
(EE) 4: /usr/bin/X (DPMSSupported+0xc6) [0x80cfb56]
(EE) 5: /usr/bin/X (xf86SerialModemClearBits+0x19b) [0x80fb15b]
(EE) 6: ? (?+0x19b) [0xb77d259a]
(EE) 7: ? (?+0x19b) [0xb77d25af]
(EE) 8: /lib/libc.so.6 (ioctl+0x19) [0xb73669a9]
(EE) 9: /lib/libdrm.so.2 (drmIoctl+0x40) [0xb758ebd0]
(EE) 10: /lib/libdrm.so.2 (drmCommandWrite+0x3f) [0xb759147f]
(EE) 11: /lib/libdrm_nouveau.so.2 (nouveau_bo_wait+0xa6) [0xb6d38146]
(EE) 12: /usr/lib/xorg/modules/drivers/nouveau_drv.so (_init+0x693a) [0xb6d4df2a]
(EE) 13: /usr/lib/xorg/modules/drivers/nouveau_drv.so (_init+0x6f91) [0xb6d4eb11]
(EE) 14: /usr/bin/X (DRI2SwapBuffers+0x36e) [0x81c48ce]
(EE) 15: /usr/bin/X (DRI2GetParam+0xcc4) [0x81c65e4]
(EE) 16: /usr/bin/X (SendErrorToClient+0x3dd) [0x807a87d]
(EE) 17: /usr/bin/X (_init+0x4659) [0x806c6d9]
(EE) 18: /lib/libc.so.6 (__libc_start_main+0xf3) [0xb7292963]
(EE) 19: /usr/bin/X (_start+0x21) [0x806887a]
(EE)
(EE) [mi] These backtraces from mieqEnqueue may point to a culprit higher up the stack.
(EE) [mi] mieq is *NOT* the cause.  It is a victim.
(EE) [mi] EQ overflow continuing.  100 events have been dropped.
(EE)
(EE) Backtrace:
(EE) 0: /usr/bin/X (mieqEnqueue+0xe8) [0x81d4048]
(EE) 1: /usr/bin/X (QueuePointerEvents+0x6c) [0x8092c5c]
(EE) 2: /usr/bin/X (xf86PostMotionEventM+0x253) [0x80e06d3]
(EE) 3: /usr/lib/xorg/modules/input/evdev_drv.so (_init+0x31c7) [0xb70d7147]
(EE) 4: /usr/bin/X (DPMSSupported+0xc6) [0x80cfb56]
(EE) 5: /usr/bin/X (xf86SerialModemClearBits+0x19b) [0x80fb15b]
(EE) 6: ? (?+0x19b) [0xb77d259a]
(EE) 7: ? (?+0x19b) [0xb77d25af]
(EE) 8: /lib/libc.so.6 (ioctl+0x19) [0xb73669a9]
(EE) 9: /lib/libdrm.so.2 (drmIoctl+0x40) [0xb758ebd0]
(EE) 10: /lib/libdrm.so.2 (drmCommandWrite+0x3f) [0xb759147f]
(EE) 11: /lib/libdrm_nouveau.so.2 (nouveau_bo_wait+0xa6) [0xb6d38146]
(EE) 12: /usr/lib/xorg/modules/drivers/nouveau_drv.so (_init+0x693a) [0xb6d4df2a]
(EE) 13: /usr/lib/xorg/modules/drivers/nouveau_drv.so (_init+0x6f91) [0xb6d4eb11]
(EE) 14: /usr/bin/X (DRI2SwapBuffers+0x36e) [0x81c48ce]
(EE) 15: /usr/bin/X (DRI2GetParam+0xcc4) [0x81c65e4]
(EE) 16: /usr/bin/X (SendErrorToClient+0x3dd) [0x807a87d]
(EE) 17: /usr/bin/X (_init+0x4659) [0x806c6d9]
(EE) 18: /lib/libc.so.6 (__libc_start_main+0xf3) [0xb7292963]
(EE) 19: /usr/bin/X (_start+0x21) [0x806887a]


As you can see it had crashed during nouveau_bo_wait+0xa6. Does it help to understand what is going on here?
Comment 13 Rostislav Krasny 2013-10-18 19:57:28 UTC
(In reply to comment #12)

After several backtraces like in the previous comment there is the following one:

(EE)
[mi] Increasing EQ size to 1024 to prevent dropped events.
[mi] EQ processing has resumed after 433 dropped events.
[mi] This may be caused my a misbehaving driver monopolizing the server's resources.
(II) AIGLX: Suspending AIGLX clients for VT switch
(EE)
(EE) Backtrace:
(EE) 0: /usr/bin/X (OsLookupColor+0x136) [0x80b6a56]
(EE) 1: ? (?+0x136) [0xb77bd541]
(EE) 2: /lib/libc.so.6 (__GI_memcpy+0x46) [0xb72e34a6]
(EE) 3: ? (?+0x46) [0x9e76e3e]
(EE)
(EE) Segmentation fault at address 0x0
(EE)
Fatal server error:
(EE) Caught signal 11 (Segmentation fault). Server aborting
(EE)
(EE)
Please consult the Fedora Project support
         at http://wiki.x.org
 for help.
(EE) Please also check the log file at "/var/log/Xorg.0.log" for additional information.
(EE)
(EE) Server terminated with error (1). Closing log file.
Comment 14 Rostislav Krasny 2013-10-23 17:42:06 UTC
Any progress? Anything else that I can do to help?
Comment 15 Ilia Mirkin 2013-11-09 20:43:25 UTC
Please take a look at https://bugs.freedesktop.org/show_bug.cgi?id=69375#c7 and see if doing that helps. I see that, at the very least, your kernel was compiled with gcc 4.8, which likely means libdrm was too.
Comment 16 Rostislav Krasny 2013-12-04 21:04:27 UTC
(In reply to comment #15)
> Please take a look at https://bugs.freedesktop.org/show_bug.cgi?id=69375#c7
> and see if doing that helps. I see that, at the very least, your kernel was
> compiled with gcc 4.8, which likely means libdrm was too.

Installed fresh Fedora 19 with latest RPMs, including libdrm-2.4.47-1.fc19.i686 package, that has the above patch in it. Unfortunatelly this doesn't help:

Dec  4 22:01:12 localhost abrt-hook-ccpp[2105]: Saved core dump of pid 1553 (/usr/bin/Xorg) to /var/tmp/abrt/ccpp-2013-12-04-22:01:11-1553 (21499904 bytes)
Dec  4 22:01:12 localhost abrt-server[2106]: Generating core_backtrace
Dec  4 22:01:12 localhost abrt-server[2106]: Generating backtrace
Dec  4 22:01:15 localhost abrt-server[2106]: journalctl: unrecognized option '--system'
Dec  4 22:01:27 localhost kernel: [  534.189018] nouveau E[ X[1553]] failed to idle channel 0xcccc0000 [X[1553]]
Dec  4 22:01:42 localhost kernel: [  549.189017] nouveau E[ X[1553]] failed to idle channel 0xcccc0000 [X[1553]]
Dec  4 22:01:42 localhost kdm[309]: X server for display :0 terminated unexpectedly
Dec  4 22:01:42 localhost kdm[309]: plymouth should quit after server startup
Dec  4 22:01:42 localhost NetworkManager[372]: <warn> error requesting auth for org.freedesktop.NetworkManager.wifi.share.open: (3) GDBus.Error:org.freedesktop.DBus.Error.NameHasNoOwner: GDBus.Error:org.freedesktop.DBus.Error.NameHasNoOwner: Could not get UID of name ':1.56': no such name
Dec  4 22:01:42 localhost NetworkManager[372]: <warn> error requesting auth for org.freedesktop.NetworkManager.wifi.share.protected: (3) GDBus.Error:org.freedesktop.DBus.Error.NameHasNoOwner: GDBus.Error:org.freedesktop.DBus.Error.NameHasNoOwner: Could not get UID of name ':1.56': no such name
Dec  4 22:01:42 localhost NetworkManager[372]: <warn> error requesting auth for org.freedesktop.NetworkManager.wifi.share.open: (3) GDBus.Error:org.freedesktop.DBus.Error.NameHasNoOwner: GDBus.Error:org.freedesktop.DBus.Error.NameHasNoOwner: Could not get UID of name ':1.56': no such name
Dec  4 22:01:57 localhost kernel: [  564.388018] nouveau E[kwin[1815]] failed to idle channel 0xcccc0000 [kwin[1815]]
Dec  4 22:02:12 localhost kernel: [  579.388016] nouveau E[kwin[1815]] failed to idle channel 0xcccc0000 [kwin[1815]]
Dec  4 22:02:12 localhost kdm[309]: Quitting Plymouth with transition
Dec  4 22:02:12 localhost kdm[309]: Is Plymouth still running? no


(gdb) bt
#0  0xb777e424 in __kernel_vsyscall ()
#1  0xb7235936 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56
#2  0xb7237173 in __GI_abort () at abort.c:90
#3  0x080b98ea in OsAbort () at utils.c:1299
#4  0x080d2d97 in ddxGiveUp (error=error@entry=EXIT_ERR_ABORT) at xf86Init.c:1075
#5  0x080d2e3b in AbortDDX (error=error@entry=EXIT_ERR_ABORT) at xf86Init.c:1119
#6  0x080b4eb2 in AbortServer () at log.c:767
#7  0x080b598f in FatalError (f=f@entry=0x81f83f4 "Caught signal %d (%s). Server aborting\n") at log.c:908
#8  0x080b6c76 in OsSigHandler (signo=11, sip=0xbfa247cc, unused=0xbfa2484c) at osinit.c:147
#9  <signal handler called>
#10 __memcpy_ia32 () at ../sysdeps/i386/i686/memcpy.S:74
#11 0xb42f4008 in ?? ()
(gdb)
Comment 17 Rostislav Krasny 2013-12-04 21:35:28 UTC
...and /var/log/kdm.log has the similar backtraces like it was in a past:


(EE) Backtrace:
(EE) 0: /usr/bin/X (mieqEnqueue+0x21b) [0x81d4adb]
(EE) 1: /usr/bin/X (QueuePointerEvents+0x6c) [0x8092f2c]
(EE) 2: /usr/bin/X (xf86PostMotionEventM+0x253) [0x80e0a93]
(EE) 3: /usr/lib/xorg/modules/input/evdev_drv.so (_init+0x31c7) [0xb70a5147]
(EE) 4: /usr/bin/X (DPMSSupported+0xc6) [0x80cff06]
(EE) 5: /usr/bin/X (xf86SerialModemClearBits+0x19b) [0x80fb54b]
(EE) 6: ? (?+0x19b) [0xb77bf59a]
(EE) 7: ? (?+0x19b) [0xb77bf5af]
(EE) 8: /lib/libc.so.6 (ioctl+0x19) [0xb7335989]
(EE) 9: /lib/libdrm.so.2 (drmIoctl+0x40) [0xb755dbf0]
(EE) 10: /lib/libdrm.so.2 (drmCommandWrite+0x3f) [0xb75604ff]
(EE) 11: /lib/libdrm_nouveau.so.2 (nouveau_bo_wait+0xa6) [0xb6d09146]
(EE) 12: /usr/lib/xorg/modules/drivers/nouveau_drv.so (_init+0x2019) [0xb6d15cf9]
(EE) 13: /usr/lib/xorg/modules/libexa.so (exaMoveOutPixmap+0x6e2e) [0xb6cd48ae]
(EE) 14: /usr/bin/X (miFillUniqueSpanGroup+0x1adf) [0x81e65bf]
(EE) 15: /usr/bin/X (xf86I2CGetScreenBuses+0x1dfb) [0x81251ab]
(EE) 16: /usr/bin/X (dixDestroyPixmap+0x1ba9) [0x8078f79]
(EE) 17: /usr/bin/X (SendErrorToClient+0x3dd) [0x807aa5d]
(EE) 18: /usr/bin/X (_init+0x4665) [0x806c8c5]
(EE) 19: /lib/libc.so.6 (__libc_start_main+0xf3) [0xb7261963]
(EE) 20: /usr/bin/X (_start+0x21) [0x8068a5a]
(EE) 
(EE) [mi] These backtraces from mieqEnqueue may point to a culprit higher up the stack.
(EE) [mi] mieq is *NOT* the cause.  It is a victim.
(EE) [mi] EQ overflow continuing.  100 events have been dropped.
(EE) 
(EE) Backtrace:
(EE) 0: /usr/bin/X (mieqEnqueue+0xe8) [0x81d49a8]
(EE) 1: /usr/bin/X (QueuePointerEvents+0x6c) [0x8092f2c]
(EE) 2: /usr/bin/X (xf86PostMotionEventM+0x253) [0x80e0a93]
(EE) 3: /usr/lib/xorg/modules/input/evdev_drv.so (_init+0x31c7) [0xb70a5147]
(EE) 4: /usr/bin/X (DPMSSupported+0xc6) [0x80cff06]
(EE) 5: /usr/bin/X (xf86SerialModemClearBits+0x19b) [0x80fb54b]
(EE) 6: ? (?+0x19b) [0xb77bf59a]
(EE) 7: ? (?+0x19b) [0xb77bf5af]
(EE) 8: /lib/libc.so.6 (ioctl+0x19) [0xb7335989]
(EE) 9: /lib/libdrm.so.2 (drmIoctl+0x40) [0xb755dbf0]
(EE) 10: /lib/libdrm.so.2 (drmCommandWrite+0x3f) [0xb75604ff]
(EE) 11: /lib/libdrm_nouveau.so.2 (nouveau_bo_wait+0xa6) [0xb6d09146]
(EE) 12: /usr/lib/xorg/modules/drivers/nouveau_drv.so (_init+0x2019) [0xb6d15cf9]
(EE) 13: /usr/lib/xorg/modules/libexa.so (exaMoveOutPixmap+0x6e2e) [0xb6cd48ae]
(EE) 14: /usr/bin/X (miFillUniqueSpanGroup+0x1adf) [0x81e65bf]
(EE) 15: /usr/bin/X (xf86I2CGetScreenBuses+0x1dfb) [0x81251ab]
(EE) 16: /usr/bin/X (dixDestroyPixmap+0x1ba9) [0x8078f79]
(EE) 17: /usr/bin/X (SendErrorToClient+0x3dd) [0x807aa5d]
(EE) 18: /usr/bin/X (_init+0x4665) [0x806c8c5]
(EE) 19: /lib/libc.so.6 (__libc_start_main+0xf3) [0xb7261963]
(EE) 20: /usr/bin/X (_start+0x21) [0x8068a5a]
(EE) 
(EE) [mi] EQ overflow continuing.  200 events have been dropped.

and so on...
Comment 18 Rostislav Krasny 2014-06-14 15:10:58 UTC
Any progress? More than six months have passed since the last message in this bug report. In that time NVidia has published some of their specifications. Did it help you to understand how NV34 works and how to fix this critical bug?
Comment 19 Linas Vepstas 2014-09-02 19:47:57 UTC
FWIW, I am hitting a similar or the same bug.  Let me keep the report brief:

-- In my case, its an NV44A
-- OS is Ubuntu 12.04 with this weeks latest, brand-new kernel: 3.16.1
-- X seems to hang/crash once/twice right after boot, and then the following kern messages start showing up:

[  168.871453] nouveau  [     DRM] 0xD2C1: Parsing digital output script table
[  168.934091] nouveau E[   PFIFO][0000:01:06.0] DMA_PUSHER - ch 1 [Xorg[2278]] get 0x00010000 put 0x00010090 state 0x80000000 (err: INVALID_CMD) push 0x00000000
[  200.472639] nouveau E[   PFIFO][0000:01:06.0] DMA_PUSHER - ch 2 [Xorg[2278]] get 0x01033000 put 0x01033088 state 0x80000000 (err: INVALID_CMD) push 0x00000000
[  215.472015] nouveau E[Xorg[2278]] failed to idle channel 0xcccc0001 [Xorg[2278]]
[  215.472066] nouveau E[   PFIFO][0000:01:06.0] DMA_PUSHER - ch 2 [Xorg[2278]] get 0x01033088 put 0x01033090 state 0x80000000 (err: INVALID_CMD) push 0x00000000
[  230.472014] nouveau E[Xorg[2278]] failed to idle channel 0xcccc0001 [Xorg[2278]]
[  245.472044] nouveau E[Xorg[2278]] failed to idle channel 0xcccc0000 [Xorg[2278]]
[  260.472018] nouveau E[Xorg[2278]] failed to idle channel 0xcccc0000 [Xorg[2278]]
[  261.637808] nouveau  [     DRM] 0xC633: Parsing digital output script table
[  261.787097] nouveau  [     DRM] 0xD2C1: Parsing digital output script table
[  261.849926] nouveau E[   PFIFO][0000:01:06.0] DMA_PUSHER - ch 1 [Xorg[2373]] 

-- Note above is almost 3 minutes after boot!
-- Note the 'fielded to idle channel' are exactly 15 seconds apart.
-- Xorg is still crashing/hanging at this point.
 Basically, it starts, the mouse pointer works fine, the lightdm tries to paint the login dialog, the painting never completes (just a square of grey), some timer kicks in, and then things repeat again.  Typically one Xorg hang + autorestart every 3 minutes, repeating 8 to 12 times (about 1/2 hour) when one of them finally, magically "just works", after which I can log in and use the machine as normal.  (OK, I have not yet tried watching youtube or doing 3D yet...)

During later attempts, I get showers of the PFIFO DMA_PUSHER errors,  then 7 minutes in, I start getting this for the first time:

[  431.120645] nouveau E[   PFIFO][0000:01:06.0] DMA_PUSHER - ch 1 [Xorg[2455]] get 0xbc882d90 put 0x00010130 state 0xc0000000 (err: MEM_FAULT) push 0x00000000
[  431.120704] nouveau E[   PFIFO][0000:01:06.0] CACHE_ERROR - ch 1 [Xorg[2455]] subc 7 mthd 0x193c data 0x3e998b7a
[  431.120727] nouveau E[   PFIFO][0000:01:06.0] CACHE_ERROR - ch 1 [Xorg[2455]] subc 7 mthd 0x1940 data 0x5e07b578
[  431.120751] nouveau E[   PFIFO][0000:01:06.0] CACHE_ERROR - ch 1 [Xorg[2455]] subc 7 mthd 0x1944 data 0x3df04d6e
[  431.120995] nouveau E[   PFIFO][0000:01:06.0] CACHE_ERROR - ch 1 [Xorg[2455]] subc 7 mthd 0x1948 data 0xd4a617e5
[  431.121017] nouveau E[   PFIFO][0000:01:06.0] CACHE_ERROR - ch 1 [Xorg[2455]] subc 7 mthd 0x194c data 0x92b307e9
[  431.121039] nouveau E[   PFIFO][0000:01:06.0] CACHE_ERROR - ch 1 [Xorg[2455]] subc 7 mthd 0x1950 data 0x05d62b60
[  431.121061] nouveau E[   PFIFO][0000:01:06.0] CACHE_ERROR - ch 1 [Xorg[2455]] subc 7 mthd 0x1954 data 0x42e9cdfa

which goes on a good while. ... then the cycle repeats, as described above.  Never quite exactly the same way twice, but similar each time.
Comment 20 Ilia Mirkin 2014-09-02 19:54:21 UTC
(In reply to comment #19)
> FWIW, I am hitting a similar or the same bug.  Let me keep the report brief:
> 
> -- In my case, its an NV44A

That's the AGP version. Can you confirm that setting agpmode to 0 doesn't help? How about vram_pushbuf=1 (although setting agpmode to 0 should arrange for pretty much the same thing)
Comment 21 Linas Vepstas 2014-09-02 20:03:55 UTC
Also, I'm on x86-64.
Also, I have TWO NV graphics cards in this system.  The problem occurs only for the one plugged into the slower PCI channel.

Also:
[    0.256789] pci 0000:02:00.0: Boot video device
[    0.256791] PCI: CLS 32 bytes, default 64
[    0.256980] PCI-DMA: Disabling AGP.
[    0.257075] PCI-DMA: aperture base @ b4000000 size 65536 KB
[    0.257079] PCI-DMA: using GART IOMMU.
[    0.257083] PCI-DMA: Reserving 64MB of IOMMU area in the AGP aperture

so the device having problems is NOT the boot device. strange, because this is the second device detected by nouveau...

Also: first nouveu error is during boot:

[    5.084189] nouveau E[   PFIFO][0000:01:06.0] DMA_PUSHER - ch 0 [DRM] get 0x00000000 put 0x00000188 state 0x80000000 (err: INVALID_CMD) push 0x00000000

and occurs before the second card is detected. No further problems until about 3 minutes later, after the first attempt at Xorg has failed, and the second attempt to start X gets going.
Comment 22 Linas Vepstas 2014-09-02 20:44:31 UTC
I don't think I have AGP .. trying to figure this out.  The north/southbridge is NVIDIA Corporation MCP55. 

Here is the PCIe bridge and the video cards:

lspci -vnn

00:0f.0 PCI bridge [0604]: NVIDIA Corporation MCP55 PCI Express bridge [10de:0377] (rev a3) (prog-if 00 [Normal decode])
	Flags: bus master, fast devsel, latency 0
	Bus: primary=00, secondary=02, subordinate=02, sec-latency=0
	Memory behind bridge: f7000000-f9ffffff
	Prefetchable memory behind bridge: 00000000e0000000-00000000efffffff
	Capabilities: [40] Subsystem: NVIDIA Corporation Device [10de:0000]
	Capabilities: [48] Power Management version 2
	Capabilities: [50] MSI: Enable+ Count=1/2 Maskable- 64bit+
	Capabilities: [60] HyperTransport: MSI Mapping Enable- Fixed-
	Capabilities: [80] Express Root Port (Slot+), MSI 00
	Capabilities: [100] Virtual Channel
	Kernel driver in use: pcieport

01:06.0 VGA compatible controller [0300]: NVIDIA Corporation NV44A [GeForce 6200] [10de:0221] (rev a1) (prog-if 00 [VGA controller])
	Subsystem: eVga.com. Corp. Device [3842:b402]
	Flags: bus master, 66MHz, medium devsel, latency 32, IRQ 16
	Memory at fb000000 (32-bit, non-prefetchable) [size=16M]
	Memory at c0000000 (32-bit, prefetchable) [size=512M]
	Memory at fa000000 (32-bit, non-prefetchable) [size=16M]
	[virtual] Expansion ROM at fc000000 [disabled] [size=128K]
	Capabilities: [60] Power Management version 2
	Kernel driver in use: nouveau

02:00.0 VGA compatible controller [0300]: NVIDIA Corporation G72 [GeForce 7300 LE] [10de:01d1] (rev a1) (prog-if 00 [VGA controller])
	Subsystem: Dell Device [1028:0405]
	Flags: bus master, fast devsel, latency 0, IRQ 42
	Memory at f7000000 (32-bit, non-prefetchable) [size=16M]
	Memory at e0000000 (64-bit, prefetchable) [size=256M]
	Memory at f8000000 (64-bit, non-prefetchable) [size=16M]
	Expansion ROM at f9fe0000 [disabled] [size=128K]
	Capabilities: [60] Power Management version 2
	Capabilities: [68] MSI: Enable+ Count=1/1 Maskable- 64bit+
	Capabilities: [78] Express Endpoint, MSI 00
	Capabilities: [100] Virtual Channel
	Capabilities: [128] Power Budgeting <?>
	Kernel driver in use: nouveau

The only dmesg lines that mention AGP, agp or gart are:

[    0.000000] AGP: No AGP bridge found
[    0.000000] AGP: Node 0: aperture [bus addr 0xb4000000-0xb5ffffff]
(32MB)
[    0.000000] Aperture pointing to e820 RAM. Ignoring.
[    0.000000] AGP: Your BIOS doesn't leave a aperture memory hole
[    0.000000] AGP: Please enable the IOMMU option in the BIOS setup
[    0.000000] AGP: This costs you 64MB of RAM
[    0.000000] AGP: Mapping aperture over RAM [mem 0xb4000000-0xb7ffffff] (65536KB)

[    0.256789] pci 0000:02:00.0: Boot video device
[    0.256791] PCI: CLS 32 bytes, default 64
[    0.256980] PCI-DMA: Disabling AGP.
[    0.257075] PCI-DMA: aperture base @ b4000000 size 65536 KB
[    0.257079] PCI-DMA: using GART IOMMU.
[    0.257083] PCI-DMA: Reserving 64MB of IOMMU area in the AGP aperture

[    0.284288] Linux agpgart interface v0.103

Should I try vram_pushbuf=1 anyway ?
Comment 23 Ilia Mirkin 2014-09-02 20:47:01 UTC
(In reply to comment #22)
> Should I try vram_pushbuf=1 anyway ?

Yep, you should. So it's a PCI version of the NV4A card? What's the other card? BTW, instead of posting partial snippets, mind just providing the full dmesg? That would save a lot of round trips.
Comment 24 Linas Vepstas 2014-09-03 04:29:50 UTC
Wow, vram_pushbuf=1 worked like a charm!  Please send patch upstream and/or change docs now!

Explored every nook and cranny of Bios for the nth time, for n>10, and found no mention of iommu or agp  or any iommu-sounding-like thing.

Will post dmesg and lspci for the successful boot, shortly.  I'll also try to pull syslog for the previously-failing boot.
Comment 25 Linas Vepstas 2014-09-03 04:39:23 UTC
Created attachment 105653 [details]
syslog showing nouveau +X struggling to come up, lots of errors
Comment 26 Linas Vepstas 2014-09-03 04:40:46 UTC
Created attachment 105654 [details]
lspci -vnn for above (and for successful dmesg below)
Comment 27 Linas Vepstas 2014-09-03 04:41:39 UTC
Created attachment 105655 [details]
dmesg for successful boot, using vram_pushbuf=1
Comment 28 Linas Vepstas 2014-09-03 04:54:20 UTC
BTW, in the 'successful dmesg' post, I just noticed that the dmesg ends with a compiz+nouveau crash. I don't care, as I do not usually run compiz; I was so excited that the thing booted easily that I completey forgot to log into my usual session, and started a compiz/ubuntu session by mistake instead. However, the system is up-to-date ubuntu 12.04 so it really should not be crashing ...
Comment 29 Linas Vepstas 2014-09-05 16:42:13 UTC
Followup: for me, the vram_pushbuf=1 fixed everything. After 2 days and a dozen youtubes, (but no 3D or fancy effects) working great, no error messages in syslog or Xorg.   Perhaps my issue was very different than the original report -- Rotislav's bug.
Comment 30 Sergii Stepanenko 2014-09-12 13:37:14 UTC
Hello all!
I have same problem on my system with 3 old PCIe NV4B cards. Tested with kernel 3.16.2 and Mesa 10.2.7, with kernel 3.14.11 and Mesa 10.2.2.

After boot I have many errors in dmesg
nouveau E[   PFIFO][0000:03:00.0] CACHE_ERROR - ch 1 [X[1450]] subc 5 mthd 0x019c data 0x80000010
where "ch" and "subc" are different, but always with 2nd or 3rd card.
Everything works well. Then I run any application that uses the GL (for example glxgears). After an interruption of the application system freezes. Then I see few
nouveau E[glxgears[1467]] failed to idle channel 0xcccc0000 [glxgears[1467]] in dmesg.
I have tried with nouveau.vram_pushbuf=1, but have same errors.

Full dmesg log: https://drive.google.com/file/d/0B6qNYzhc-QU-eHo5aDNhOWlxc00
Full Xorg.log:  https://drive.google.com/file/d/0B6qNYzhc-QU-ZWdqdFRCNjlkeWs
Comment 31 Linas Vepstas 2014-09-12 15:37:15 UTC
Sergei, at the boot prompt (grub) I'm using vram_pushbuf=1 and NOT nouveau.vram_pushbuf=1.  I have not tried using /etc/sysctl.conf although I guess that should work, and I guess the entry nouveau.vram_pushbuf=1 might be right.
Comment 32 Sergii Stepanenko 2014-09-15 07:31:43 UTC
Linas, I've tried 

vram_pushbuf=1
nouveau.vram_pushbuf=1
nouveau.vram_pushbuf=1 nouveau.agpmode=0
nouveau.agpmode=0
nouveau.vram_notify=1
nouveau.nofbaccel=1

But always have same errors
Comment 33 Linas Vepstas 2014-10-31 20:50:26 UTC
FYI,

New kernel, new CPU, same old graphics cards:  if vram_pushbuf=1 not specified on boot line, then kernel hangs about 4 seconds in (near first modesetting); I can't capture the dmesg ...

This is with this-weeks kernel on mint-qiana-17  Annotated dmesg below; I can provide the full dmesg if you really want it ... 


[    0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-3.18.0-rc2-linas root=UUID=18a92961-a17b-4205-9038-a6eb91849706 ro nouveau.vram_pushbuf=1 nomdmonddf nomdmonisw
...

[    0.000000] AGP: Checking aperture...
[    0.000000] AGP: No AGP bridge found
[    0.000000] AGP: Node 0: aperture [bus addr 0x00000000-0x01ffffff] (32MB)
[    0.000000] AGP: Your BIOS doesn't leave a aperture memory hole
[    0.000000] AGP: Please enable the IOMMU option in the BIOS setup
[    0.000000] AGP: This costs you 64MB of RAM
[    0.000000] AGP: Mapping aperture over RAM [mem 0x94000000-0x97ffffff] (65536KB)

strange to see above, as I'm pretty sure I have the IOMMU turned on ...
I'm pretty sure the machine does not have any agp slots, instead, there are two PCIe x16 slots ... 
...
[    0.000000] DMI: ASUSTeK Computer INC. KGP(M)E-D16/KGP(M)E-D16, BIOS 3201    02/19/2013
...
[    0.648524] pci 0000:00:14.4: can't claim BAR 8 [mem 0xf5800000-0xf8ffffff]: address conflict with amd_iommu [mem 0xf6000000-0xf6003fff]

Note that one of the two graphics cards is a PCI card NOT a PCIe card!!! I mention this because of the above BAR road-bump

00:14.4 PCI bridge: Advanced Micro Devices, Inc. [AMD/ATI] SBx00 PCI to PCI Bridge

[    1.947594] AMD-Vi: Found IOMMU at 0000:00:00.2 cap 0x40
[    1.947599] AMD-Vi: Interrupt remapping enabled
[    1.947617] pci 0000:00:00.2: irq 25 for MSI/MSI-X
[    1.958894] AMD-Vi: Lazy IO/TLB flushing enabled
[    2.042066] PCI-DMA: Using software bounce buffering for IO (SWIOTLB)

[    4.009084] fbcon: nouveaufb (fb1) is primary device
[    4.009086] fbcon: Remapping primary device, fb1, to tty 1-63
[    4.044782] Switched to clocksource tsc
[    4.050013] nouveau  [     DRM] 0xC633: Parsing digital output script table
[    4.161782] nouveau E[     DRM] GPU lockup - switching to software fbcon


I have two DVI cables, one VGA cable.  The edid to the VGA fails ... it didn't use to ... 

[   16.869739] [drm:drm_edid_block_valid] *ERROR* EDID checksum is invalid, remainder is 178
[   16.869747] Raw EDID:
[   16.869749]          00 ff ff ff ff ff ff 00 04 72 a6 01 12 12 12 12
[   16.869751]          15 15 01 03 68 27 ff ff ff ff ff ff ff ff ff ff
etc. 
 it tries multiple times, gets different results ... 

Xorg.0.log shows no startup errors, but the framebuffer shows mostly corruption.

kill -9 of X sends X into 100% cpu usage.  While it does that, this gets logged:
Oct 31 15:36:35 fanny kernel: [ 2449.191822] nouveau E[Xorg[2290]] failed to idle channel 0xcccc0002 [Xorg[2290]]
Oct 31 15:36:50 fanny kernel: [ 2464.194418] nouveau E[Xorg[2290]] failed to idle channel 0xcccc0002 [Xorg[2290]]
Oct 31 15:37:05 fanny kernel: [ 2479.197017] nouveau E[Xorg[2290]] failed to idle channel 0xcccc0001 [Xorg[2290]]
Oct 31 15:37:20 fanny kernel: [ 2494.199611] nouveau E[Xorg[2290]] failed to idle channel 0xcccc0001 [Xorg[2290]]
Oct 31 15:37:35 fanny kernel: [ 2509.202213] nouveau E[Xorg[2290]] failed to idle channel 0xcccc0000 [Xorg[2290]]
Oct 31 15:37:50 fanny kernel: [ 2524.204809] nouveau E[Xorg[2290]] failed to idle channel 0xcccc0000 [Xorg[2290]]


After this, starting X a second time does not generate any video-out, so monitors stay blanked. Nothing at all printed in /var/log/syslog. However kern.log shows:

Oct 31 15:41:47 fanny kernel: [ 2761.045902] INFO: task kworker/10:1:135 blocked for more than 120 seconds.
Oct 31 15:41:47 fanny kernel: [ 2761.045912]       Not tainted 3.18.0-rc2-linas #5
Oct 31 15:41:47 fanny kernel: [ 2761.045915] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 31 15:41:47 fanny kernel: [ 2761.045920] kworker/10:1    D ffff88085dd53040     0   135      2 0x00000000
Oct 31 15:41:47 fanny kernel: [ 2761.045959] Workqueue: events output_poll_execute [drm_kms_helper]
Oct 31 15:41:47 fanny kernel: [ 2761.045964]  ffff88082ea0fc28 0000000000000002 ffff88082e9e1910 ffff88082ea0ffd8
Oct 31 15:41:47 fanny kernel: [ 2761.045971]  0000000000013040 0000000000013040 ffff880831f88000 ffff88082e9e1910
Oct 31 15:41:47 fanny kernel: [ 2761.045978]  ffff88082ea0fc38 ffff8808315a7898 ffff8808315a7946 ffff8808315a79c8
Oct 31 15:41:47 fanny kernel: [ 2761.045985] Call Trace:
Oct 31 15:41:47 fanny kernel: [ 2761.046000]  [<ffffffff81592cf9>] schedule+0x29/0x70
Oct 31 15:41:47 fanny kernel: [ 2761.046010]  [<ffffffff813cd64d>] rpm_resume+0x22d/0x620
Oct 31 15:41:47 fanny kernel: [ 2761.046020]  [<ffffffff81080770>] ? prepare_to_wait_event+0x110/0x110
Oct 31 15:41:47 fanny kernel: [ 2761.046028]  [<ffffffff813cda8d>] __pm_runtime_resume+0x4d/0x70
Oct 31 15:41:47 fanny kernel: [ 2761.046035]  [<ffffffff8105a4f6>] ? __queue_delayed_work+0x166/0x1a0
Oct 31 15:41:47 fanny kernel: [ 2761.046106]  [<ffffffffa040c2b6>] nouveau_connector_detect+0x66/0x3d0 [nouveau]
Oct 31 15:41:47 fanny kernel: [ 2761.046124]  [<ffffffffa011c608>] output_poll_execute+0xb8/0x180 [drm_kms_helper]
Oct 31 15:41:47 fanny kernel: [ 2761.046132]  [<ffffffff8105ab06>] process_one_work+0x146/0x400
Oct 31 15:41:47 fanny kernel: [ 2761.046139]  [<ffffffff8105aee0>] worker_thread+0x120/0x490
Oct 31 15:41:47 fanny kernel: [ 2761.046146]  [<ffffffff8105adc0>] ? process_one_work+0x400/0x400
Oct 31 15:41:47 fanny kernel: [ 2761.046153]  [<ffffffff8105fd82>] kthread+0xd2/0xf0
Oct 31 15:41:47 fanny kernel: [ 2761.046159]  [<ffffffff8105fcb0>] ? kthread_create_on_node+0x1a0/0x1a0
Oct 31 15:41:47 fanny kernel: [ 2761.046166]  [<ffffffff8159722c>] ret_from_fork+0x7c/0xb0
Oct 31 15:41:47 fanny kernel: [ 2761.046172]  [<ffffffff8105fcb0>] ? kthread_create_on_node+0x1a0/0x1a0
Oct 31 15:41:47 fanny kernel: [ 2761.046179] INFO: task kworker/3:1:175 blocked for more than 120 seconds.
Oct 31 15:41:47 fanny kernel: [ 2761.046183]       Not tainted 3.18.0-rc2-linas #5
Oct 31 15:41:47 fanny kernel: [ 2761.046186] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 31 15:41:47 fanny kernel: [ 2761.046189] kworker/3:1     D ffff88085dc73040     0   175      2 0x00000000
Oct 31 15:41:47 fanny kernel: [ 2761.046197] Workqueue: pm pm_runtime_work
Oct 31 15:41:47 fanny kernel: [ 2761.046201]  ffff88082e593bb8 0000000000000002 ffff88082ebbcb30 ffff88082e593fd8
Oct 31 15:41:47 fanny kernel: [ 2761.046207]  0000000000013040 0000000000013040 ffff880831f4cb30 ffff88082ebbcb30
Oct 31 15:41:47 fanny kernel: [ 2761.046214]  ffff88082e593bc8 ffff8808315a7898 ffff8808315a7946 ffff8808315a79c8
Oct 31 15:41:47 fanny kernel: [ 2761.046221] Call Trace:
Oct 31 15:41:47 fanny kernel: [ 2761.046229]  [<ffffffff81592cf9>] schedule+0x29/0x70
Oct 31 15:41:47 fanny kernel: [ 2761.046236]  [<ffffffff813cd64d>] rpm_resume+0x22d/0x620
Oct 31 15:41:47 fanny kernel: [ 2761.046243]  [<ffffffff81080770>] ? prepare_to_wait_event+0x110/0x110
Oct 31 15:41:47 fanny kernel: [ 2761.046251]  [<ffffffff813cdd5c>] pm_runtime_forbid+0x4c/0x60
Oct 31 15:41:47 fanny kernel: [ 2761.046310]  [<ffffffffa03f6ba8>] nouveau_pmops_runtime_suspend+0xc8/0xf0 [nouveau]
Oct 31 15:41:47 fanny kernel: [ 2761.046320]  [<ffffffff8131a8ba>] pci_pm_runtime_suspend+0x6a/0x120
Oct 31 15:41:47 fanny kernel: [ 2761.046326]  [<ffffffff8131a850>] ? pci_pm_runtime_resume+0x100/0x100
Oct 31 15:41:47 fanny kernel: [ 2761.046332]  [<ffffffff813cc4bd>] __rpm_callback+0x3d/0x90
Oct 31 15:41:47 fanny kernel: [ 2761.046338]  [<ffffffff81059fd1>] ? try_to_grab_pending+0xb1/0x150
Oct 31 15:41:47 fanny kernel: [ 2761.046345]  [<ffffffff813cc53f>] rpm_callback+0x2f/0x90
Oct 31 15:41:47 fanny kernel: [ 2761.046352]  [<ffffffff813cca61>] rpm_suspend+0x111/0x640
Oct 31 15:41:47 fanny kernel: [ 2761.046359]  [<ffffffff8106b251>] ? get_parent_ip+0x11/0x50
Oct 31 15:41:47 fanny kernel: [ 2761.046365]  [<ffffffff8106b2e5>] ? preempt_count_add+0x55/0xb0
Oct 31 15:41:47 fanny kernel: [ 2761.046370]  [<ffffffff813ce3f2>] pm_runtime_work+0x82/0xb0
Oct 31 15:41:47 fanny kernel: [ 2761.046377]  [<ffffffff8105ab06>] process_one_work+0x146/0x400
Oct 31 15:41:47 fanny kernel: [ 2761.046384]  [<ffffffff8105aee0>] worker_thread+0x120/0x490
Oct 31 15:41:47 fanny kernel: [ 2761.046391]  [<ffffffff8105adc0>] ? process_one_work+0x400/0x400
Oct 31 15:41:47 fanny kernel: [ 2761.046396]  [<ffffffff8105fd82>] kthread+0xd2/0xf0
Oct 31 15:41:47 fanny kernel: [ 2761.046403]  [<ffffffff8105fcb0>] ? kthread_create_on_node+0x1a0/0x1a0
Oct 31 15:41:47 fanny kernel: [ 2761.046408]  [<ffffffff8159722c>] ret_from_fork+0x7c/0xb0
Oct 31 15:41:47 fanny kernel: [ 2761.046414]  [<ffffffff8105fcb0>] ? kthread_create_on_node+0x1a0/0x1a0



FWIW: my kernel has CONFIG_PM_CORE enabled, but much/most of the rest of ACPI is not enabled (its not a laptop, and never gets turned off or rebooted).

Don't know if any of my rambling helped, but I conclude:

-- on latest rc kernel, with three-head, dual old graphics cards, one of which is PCI not PCIe, that nouveau has multiple possibly unrelated issues ...
Comment 34 Linas Vepstas 2014-10-31 22:17:34 UTC
Please note: above trouble with today's rc kernel 3.18-rc2 is a *regression* from  the stock kernel that comes w/ linux mint-qiana-17:  Linux version 3.13.0-39-generic (buildd@toyol) (gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1) ) #66-Ubuntu SMP Tue Oct 28 13:30:27 UTC 2014

The stock kernel seems to mostly work fine-ish, for both 3D (i.e. blender) and youtube.  Halting and restarting X does ggenerate the 'failed to idle' messages .. but since I never restart X, I can live w/it.

BTW: noveau 2d rendering rocks, much faster than the nvidia drivers, which are noticeably laggy and slow for exactly the same hw & os setup.
Comment 35 Ilia Mirkin 2017-03-11 01:44:36 UTC
Huh. Well, I totally forgot about this bug. And due to wanting to cram ever-more GPUs into my box, ended up getting a PCI NV4x so that I could keep using 2x PCIe slots for "better" GPUs. Well, this turned out to be a NV44A with a PCI bridge, and I'm getting the same issues.

FTR, I have:

# lspci -nn -d ::300
02:00.0 VGA compatible controller [0300]: NVIDIA Corporation GF108 [GeForce GT 620] [10de:0f01] (rev a1)
04:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM107 [GeForce GTX 745] [10de:1382] (rev a2)
09:00.0 VGA compatible controller [0300]: NVIDIA Corporation NV44A [GeForce 6200] [10de:0221] (rev a1)
09:01.0 VGA compatible controller [0300]: NVIDIA Corporation NV34 [GeForce FX 5200] [10de:0322] (rev a1)

And only the NV44A is giving me trouble. The NV34 works like a champ, as do the newer GPUs. On boot, with a 4.8.5 kernel, I got:

[    8.114998] [drm] Initialized nouveau 1.3.1 20120801 for 0000:04:00.0 on minor 1
[    8.115015] nouveau 0000:09:00.0: enabling device (0000 -> 0002)
[    8.115152] nouveau 0000:09:00.0: NVIDIA NV44A (04a100a1)
[    8.302416] nouveau 0000:09:00.0: bios: version 05.44.a2.10.61
[    8.344251] nouveau 0000:09:00.0: fb: 256 MiB DDR2
[    8.470618] nouveau 0000:09:00.0: DRM: VRAM: 252 MiB
[    8.470619] nouveau 0000:09:00.0: DRM: GART: 512 MiB
[    8.470621] nouveau 0000:09:00.0: DRM: TMDS table version 1.1
[    8.470622] nouveau 0000:09:00.0: DRM: DCB version 3.0
[    8.470624] nouveau 0000:09:00.0: DRM: DCB outp 00: 01000300 00000028
[    8.470625] nouveau 0000:09:00.0: DRM: DCB outp 01: 02011310 00000028
[    8.470627] nouveau 0000:09:00.0: DRM: DCB outp 02: 01011312 00000000
[    8.470628] nouveau 0000:09:00.0: DRM: DCB outp 03: 020223f1 00c0c030
[    8.470629] nouveau 0000:09:00.0: DRM: DCB conn 00: 0000
[    8.470630] nouveau 0000:09:00.0: DRM: DCB conn 01: 2130
[    8.470632] nouveau 0000:09:00.0: DRM: DCB conn 02: 0210
[    8.470633] nouveau 0000:09:00.0: DRM: DCB conn 03: 0211
[    8.470634] nouveau 0000:09:00.0: DRM: DCB conn 04: 0213
[    8.470647] nouveau 0000:09:00.0: DRM: Adaptor not initialised, running VBIOS init tables.
[    8.470773] nouveau 0000:09:00.0: DRM: Saving VGA fonts
[    8.577099] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
[    8.577101] [drm] Driver supports precise vblank timestamp query.
[    8.582396] nouveau 0000:09:00.0: DRM: MM: using CPU for buffer copies
[    8.582407] nouveau 0000:09:00.0: DRM: Setting dpms mode 3 on TV encoder (output 3)
[    8.632789] nouveau 0000:09:00.0: DRM: Load detected on output B
[    8.642638] nouveau 0000:09:00.0: DRM: allocated 720x576 fb: 0x9000, bo ffff8f5961eda000
[    8.642701] nouveau 0000:09:00.0: fifo: DMA_PUSHER - ch 0 [DRM] get 00000000 put 00000178 state 80000000 (err: INVALID_CMD) push 00000010
[    8.642794] nouveau 0000:09:00.0: fb2: nouveaufb frame buffer device

and when trying to start X on it, I got plenty more errors:

[  441.868341] nouveau 0000:09:00.0: DRM: Load detected on output B
[  441.920374] nouveau 0000:09:00.0: DRM: Load detected on output B
[  441.965447] nouveau 0000:09:00.0: fifo: DMA_PUSHER - ch 2 [X[2962]] get 00049000 put 00049090 state 80000000 (err: INVALID_CMD) push 00000010
[  441.966367] nouveau 0000:09:00.0: fifo: DMA_PUSHER - ch 1 [X[2962]] get 00018000 put 000180a0 state 80000000 (err: INVALID_CMD) push 00000010
[  471.988366] nouveau 0000:09:00.0: DRM: Setting dpms mode 0 on TV encoder (output 3)
[  471.988448] nouveau 0000:09:00.0: DRM: Output TV-1 is running on CRTC 0 using output B
[  472.033852] nouveau 0000:09:00.0: fifo: DMA_PUSHER - ch 1 [X[2962]] get 000180a0 put 000180b0 state 80000000 (err: INVALID_CMD) push 00000000
[  472.407262] nouveau 0000:09:00.0: fifo: DMA_PUSHER - ch 1 [X[2962]] get 000180b0 put 000180c0 state 80000000 (err: INVALID_CMD) push 00000000
[  483.836587] nouveau 0000:09:00.0: fifo: DMA_PUSHER - ch 1 [X[2962]] get 000180c0 put 000180d0 state 80000000 (err: INVALID_CMD) push 00000000
[  483.838904] nouveau 0000:09:00.0: fifo: DMA_PUSHER - ch 1 [X[2962]] get 000180d0 put 000180f0 state 80000000 (err: INVALID_CMD) push 00000000

and so on. Will investigate, I guess.
Comment 36 Ilia Mirkin 2017-03-18 20:26:29 UTC
Looks like the NV4A MMU doesn't like PCIE mode. I sent a patch which fixes it for me: https://patchwork.freedesktop.org/patch/144818/

Note that this has nothing to do with the originally-filed issue for the NV34 (sorry), only for Linas's NV44A issue (and mine).
Comment 37 Lukas Wunner 2018-02-26 11:53:35 UTC
(In reply to Linas Vepstas from comment #33)
> After this, starting X a second time does not generate any video-out, so
> monitors stay blanked. Nothing at all printed in /var/log/syslog. However
> kern.log shows:
> 
> Oct 31 15:41:47 fanny kernel: [ 2761.045902] INFO: task kworker/10:1:135
> blocked for more than 120 seconds.
> Oct 31 15:41:47 fanny kernel: [ 2761.045912]       Not tainted
> 3.18.0-rc2-linas #5
> Oct 31 15:41:47 fanny kernel: [ 2761.045915] "echo 0 >
> /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> Oct 31 15:41:47 fanny kernel: [ 2761.045920] kworker/10:1    D
> ffff88085dd53040     0   135      2 0x00000000
> Oct 31 15:41:47 fanny kernel: [ 2761.045959] Workqueue: events
> output_poll_execute [drm_kms_helper]

This hung task issue has been fixed in v4.16-rc3 with https://git.kernel.org/linus/d61a5c106351

It is unrelated however to the "failed to idle channel" issue AFAICS.
Comment 38 GitLab Migration User 2019-09-18 20:38:49 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1056.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.