Bug 111642 - NV43 GeForce 6600 Nouveau is not stable on legacy hardware
Summary: NV43 GeForce 6600 Nouveau is not stable on legacy hardware
Status: NEW
Alias: None
Product: xorg
Classification: Unclassified
Component: Driver/nouveau (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) All
: not set not set
Assignee: Nouveau Project
QA Contact: Xorg Project Team
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-09-11 06:43 UTC by Vasili Pupkin
Modified: 2019-09-16 12:39 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Vasili Pupkin 2019-09-11 06:43:56 UTC
First of all I appreciate this community work, the driver didn't work at all for the NV43 gpu few years ago but now it is working but not stable yet. 

I am on Ubuntu 18.04.3 LTS, tested on kernels 4.15/4.18/5.0, xserver-xorg-video-nouveau 1:1.0.15-2, libdrm-nouveau2 2.4.97-1ubuntu1~18.04.1
Have also tested it with xserver-xorg-video-nouveau-hwe-18.04 1:1.0.16-1~18.04.1
dmesg is full of these lines:
[  199.658774] nouveau 0000:04:00.0: systemd-logind[1352]: validate: -22
[  199.658902] nouveau 0000:04:00.0: systemd-logind[1352]: fail set_domain
[  199.658905] nouveau 0000:04:00.0: systemd-logind[1352]: validating bo list
[  199.658907] nouveau 0000:04:00.0: systemd-logind[1352]: validate: -22
[  200.075155] nouveau 0000:04:00.0: systemd-logind[1352]: fail set_domain
[  200.075158] nouveau 0000:04:00.0: systemd-logind[1352]: validating bo list
[  200.075160] nouveau 0000:04:00.0: systemd-logind[1352]: validate: -22
[  200.075215] nouveau 0000:04:00.0: systemd-logind[1352]: fail set_domain
syslog is full of these lines:
Sep 11 01:50:33 /usr/lib/gdm3/gdm-x-session[7653]: nouveau: #0110x000482fc
Sep 11 01:50:33 /usr/lib/gdm3/gdm-x-session[7653]: nouveau: #0110x00000003
Sep 11 01:50:33 /usr/lib/gdm3/gdm-x-session[7653]: nouveau: #0110x00104300
Sep 11 01:50:33 /usr/lib/gdm3/gdm-x-session[7653]: nouveau: #0110x0000000a
Sep 11 01:50:33 /usr/lib/gdm3/gdm-x-session[7653]: nouveau: #0110x14001400
Sep 11 01:50:33 /usr/lib/gdm3/gdm-x-session[7653]: nouveau: #0110x0025a000
Sep 11 01:50:33 /usr/lib/gdm3/gdm-x-session[7653]: nouveau: #0110x01240000
Sep 11 01:50:33 /usr/lib/gdm3/gdm-x-session[7653]: nouveau: #0110x00000000
Sep 11 01:50:33 /usr/lib/gdm3/gdm-x-session[7653]: message repeated 3 times: [ nouveau: #0110x00000000]
Sep 11 01:50:33 /usr/lib/gdm3/gdm-x-session[7653]: nouveau: #0110x000c8300
Sep 11 01:50:33 /usr/lib/gdm3/gdm-x-session[7653]: nouveau: #0110x00000000
Sep 11 01:50:33 /usr/lib/gdm3/gdm-x-session[7653]: nouveau: #0110x00000000
Sep 11 01:50:33 /usr/lib/gdm3/gdm-x-session[7653]: nouveau: #0110x04000500
Sep 11 01:50:33 /usr/lib/gdm3/gdm-x-session[7653]: nouveau: #0110x00020000
Sep 11 01:50:33 /usr/lib/gdm3/gdm-x-session[7653]: nouveau: #0110x00000000
Sep 11 01:50:33 /usr/lib/gdm3/gdm-x-session[7653]: nouveau: kernel rejected pushbuf: Invalid argument
Sep 11 01:50:33 /usr/lib/gdm3/gdm-x-session[7653]: nouveau: ch2: krec 0 pushes 1 bufs 3 relocs 4
Sep 11 01:50:33 /usr/lib/gdm3/gdm-x-session[7653]: nouveau: ch2: buf 00000000 00000005 00000004 00000004 00000000
Sep 11 01:50:33 /usr/lib/gdm3/gdm-x-session[7653]: nouveau: ch2: buf 00000001 0000000b 00000002 00000002 00000000
Sep 11 01:50:33 /usr/lib/gdm3/gdm-x-session[7653]: nouveau: ch2: buf 00000002 0000000c 00000002 00000000 00000002
Sep 11 01:50:33 /usr/lib/gdm3/gdm-x-session[7653]: nouveau: ch2: rel 00000000 00002d30 00000001 00000000 00044308 00000000 00000000
Sep 11 01:50:33 /usr/lib/gdm3/gdm-x-session[7653]: nouveau: ch2: rel 00000000 00002d34 00000001 00000001 00000000 00000000 00000000
Sep 11 01:50:33 /usr/lib/gdm3/gdm-x-session[7653]: nouveau: ch2: rel 00000000 00002d38 00000002 00000000 0004430c 00000000 00000000
Sep 11 01:50:33 /usr/lib/gdm3/gdm-x-session[7653]: nouveau: ch2: rel 00000000 00002d3c 00000002 00000001 00000000 00000000 00000000
Sep 11 01:50:33 /usr/lib/gdm3/gdm-x-session[7653]: nouveau: ch2: psh 00000000 0000002d30 0000002d48
... and there are 5 gigs of this messages in syslog

The system is working but sometimes just freeze completely. May work for hours and then freeze. May freeze on login.
Comment 1 Ilia Mirkin 2019-09-11 11:50:38 UTC
Those messages imply you've run out of vram. Are you using something like gnome or kde? Those won't work well on this hardware.
Comment 2 Vasili Pupkin 2019-09-11 12:00:41 UTC
Ubuntu 16.04 works just fine with nvidia-304 drivers. If the lack of ram is the problem it would be helpful to have better diagnostic messages.
Comment 3 Ilia Mirkin 2019-09-11 12:06:26 UTC
(In reply to Vasili Pupkin from comment #2)
> Ubuntu 16.04 works just fine with nvidia-304 drivers. If the lack of ram is
> the problem it would be helpful to have better diagnostic messages.

We have them.

[  199.658774] nouveau 0000:04:00.0: systemd-logind[1352]: validate: -22
[  199.658902] nouveau 0000:04:00.0: systemd-logind[1352]: fail set_domain
[  199.658905] nouveau 0000:04:00.0: systemd-logind[1352]: validating bo list

This indicates a lack of ability to place all the buffers needed into vram/gart as requested by the submitter.

The NVIDIA drivers will work much better for you if you're trying to make heavy use of GL, like modern systems like to do.
Comment 4 Vasili Pupkin 2019-09-11 12:16:38 UTC
This diagnostics is only make sense for developers of nouveau. 
nvidia-304 drivers are dropped and nouveau seams like the only out of the box option for Ubuntu 18.04. But why is nvidia drivers don't experience this lack of vram problem?
Comment 5 Ilia Mirkin 2019-09-11 12:18:04 UTC
(In reply to Vasili Pupkin from comment #4)
> This diagnostics is only make sense for developers of nouveau. 
> nvidia-304 drivers are dropped and nouveau seams like the only out of the
> box option for Ubuntu 18.04. But why is nvidia drivers don't experience this
> lack of vram problem?

Because they've had man-decades invested into their development to ensure that they handle these types of situations well. Nouveau GL drivers have not.
Comment 6 Vasili Pupkin 2019-09-11 12:23:41 UTC
The lack of human resources is sad but it is still a bug. Mark it as wontfix if the support of this legacy hardware is outside of the project goal. 

I am happy to test a patch if it is not architectural impossible in nouveau to fix it and back vram in main memory.
Comment 7 Ilia Mirkin 2019-09-11 12:29:33 UTC
In the meanwhile, I suspect if you add LIBGL_ALWAYS_SOFTWARE=1 into your /etc/environment, you will be much happier.

You can then still enable GL for certain programs that you actually want to use it for, but not for random GTK/Qt programs that want to draw a button and think it's a great idea to start using GL for that.
Comment 8 Vasili Pupkin 2019-09-11 13:23:10 UTC
LIBGL_ALWAYS_SOFTWARE=1 didn't help at all, same messages in dmesg and syslog
Comment 9 Ilia Mirkin 2019-09-11 13:25:20 UTC
(In reply to Vasili Pupkin from comment #8)
> LIBGL_ALWAYS_SOFTWARE=1 didn't help at all, same messages in dmesg and syslog

Must not have gotten picked up =/

Just remove nouveau_dri.so from ... /usr/lib/dri/ or something along those lines.
Comment 10 Vasili Pupkin 2019-09-11 16:50:02 UTC
Removing nouveau_dri.so didn't help either
Comment 11 Ilia Mirkin 2019-09-11 17:44:55 UTC
(In reply to Vasili Pupkin from comment #10)
> Removing nouveau_dri.so didn't help either

Erm ... that implies that the error is not from what I think it is.

Just to super-triple-check ... run "glxinfo" - does that say you're using LLVMpipe or nouveau?
Comment 12 Vasili Pupkin 2019-09-11 18:31:41 UTC
It shows

OpenGL renderer string: llvmpipe (LLVM 8.0, 256 bits)
Comment 13 Ilia Mirkin 2019-09-12 05:42:22 UTC
(In reply to Vasili Pupkin from comment #12)
> It shows
> 
> OpenGL renderer string: llvmpipe (LLVM 8.0, 256 bits)

OK. So you've successfully killed off nouveau GL impl, but you're still seeing those errors? That's very very very surprising.

Can you tell me more about your environment? (Have you tried restarting your desktop / window manager? Otherwise this is all moot.)
Comment 14 Vasili Pupkin 2019-09-12 20:10:16 UTC
The system has two identical card installed (I completely forgotten about this fact because no monitor is connected to the second adapter). I've connected monitor to the second one and it only shows cursor, no window is shown when I move one to the second monitor. Ok. I removed the second card and nouveau stop spinning dmesg with those messages. The freezes remains and it feels less stable with one card than it was with two cards, rarely survive more than five minutes.

So there are two questions: 
Are those messages a bug or the two adapter setup require some additional configuration to work properly?
How to debug freezes? Ctrl+Alt+F2 doesn't work.

I've found this stacktrace in syslog at the end, not sure if it is the last stack before crash

------------[ cut here ]------------
nouveau 0000:01:00.0: timeout
....
....
 Call Trace:
  nvkm_vmm_iter.constprop.12+0x2e5/0x880 [nouveau]
  ? nv41_vmm_pgt_sgl+0x140/0x140 [nouveau]
  ? nvkm_vmm_free_insert+0x80/0x80 [nouveau]
  ? nvkm_vmm_put_region+0xd0/0x160 [nouveau]
  nvkm_vmm_ptes_unmap_put+0x32/0x50 [nouveau]
  ? nv41_vmm_pgt_sgl+0x140/0x140 [nouveau]
  nvkm_vmm_put_locked+0x103/0x220 [nouveau]
  nvkm_uvmm_mthd+0x7eb/0x850 [nouveau]
  nvkm_object_mthd+0x1a/0x30 [nouveau]
  nvkm_ioctl_mthd+0x5d/0xb0 [nouveau]
  nvkm_ioctl+0x11d/0x280 [nouveau]
  nvkm_client_ioctl+0x12/0x20 [nouveau]
  nvif_object_ioctl+0x47/0x50 [nouveau]
  nvif_object_mthd+0x129/0x150 [nouveau]
  ? __ttm_dma_free_page.isra.5+0x32/0x40 [ttm]
  ? isolate_huge_page+0x30/0xa0
  ? __ttm_dma_free_page.isra.5+0x32/0x40 [ttm]
  ? ttm_dma_page_put+0x53/0x90 [ttm]
  nvif_vmm_put+0x5f/0x80 [nouveau]
  nouveau_mem_fini+0x3b/0x70 [nouveau]
  nv04_sgdma_unbind+0x12/0x20 [nouveau]
  ttm_tt_unbind+0x21/0x40 [ttm]
  ttm_tt_destroy.part.12+0x12/0x60 [ttm]
  ttm_tt_destroy+0x13/0x20 [ttm]
  ttm_bo_cleanup_memtype_use+0x32/0x70 [ttm]
  ttm_bo_cleanup_refs+0x1c0/0x200 [ttm]
  ? ttm_mem_global_free+0x13/0x20 [ttm]
  ttm_bo_delayed_delete+0x1cd/0x1e0 [ttm]
  ttm_bo_delayed_workqueue+0x1b/0x40 [ttm]
  process_one_work+0x1fd/0x400
  worker_thread+0x34/0x410
  kthread+0x121/0x140
  ? process_one_work+0x400/0x400
  ? kthread_park+0x90/0x90
  ret_from_fork+0x35/0x40
Comment 15 Ilia Mirkin 2019-09-12 22:57:50 UTC
Your desktop environment could be trying to add the second GPU's outputs and/or trying to use it for render offload. Neither will work well with nv4x generation GPUs. I can believe that the set_domain stuff is failing because of that. I hadn't considered that option.

The timeout is bad - that means something hung. Probably the messages before that would be more interesting than after. The first error tends to be the most useful one.

What desktop environment are you using, if any?
Comment 16 Vasili Pupkin 2019-09-14 03:48:48 UTC
I am using gnome 3.28.2 on xorg 1.20.4, wayland disappeared from the list on gdm login screen after recent upgrades
Comment 17 Vasili Pupkin 2019-09-15 20:29:24 UTC
It seems that this timeout error in nv41_vmm_flush may not be the cause of the problem but a consequence of the bug. nouveau starts issuing this timeout exception log after a freeze and there can be quite a few such timeouts until I restart the system.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.