Since upgrading from a 3.6.10 kernel to a 3.7.9 kernel (Fedora 18), the system started hanging while loading gdm. Going back to the 3.6 kernel works, going to a 3.8 kernel hangs the same way.
The hang manifests first as the gdm spinning "loading" cursor stopping, followed a few seconds later by a screen full of a repeated noise pattern (different for each boot, but looks like a small block of noise tiled all over the screen). The machine does not answer to the keyboard (num lock/caps lock LEDs) or the network (by this point NetworkManager already started the interface, so I can get a few pings before it hangs).
In particular, it is not the initial modeset when loading the module which hangs; it is only gdm. If I boot with systemd.unit=multi-user.target, I can boot without any hangs, load netconsole, and isolate graphical.target to load gdm to see it hang. I used that to capture the kernel messages until the hang for the Fedora 3.7.9 kernel.
I also managed to capture the Xorg.0.log for the attempt with the 3.8 kernel (booting directly into graphical.target).
The relevant version numbers are:
Upstream 3.8 kernel with a Fedora config (hangs)
I also attempted once booting 3.7.9 with nouveau.config=NvPCIE=0 as suggested at bug #58776; it did not make any visible difference.
Created attachment 75390 [details]
First part of attempt with 3.7.9
First part of attempt with the 3.7.9 kernel, captured with dmesg after booting into multi-user.target
Created attachment 75391 [details]
Second part of attempt with 3.7.9
Second part of attempt with the 3.7.9 kernel, captured with netconsole before and during systemd isolate of graphical.target
Created attachment 75392 [details]
lspci -v (with 3.6 kernel)
Created attachment 75393 [details]
Xorg.0.log.old from attempt at kernel 3.8, copied after boot to working kernel
This was from when I attempted booting into the 3.8 kernel. Notice the block of NUL bytes at end of the file, suggesting it might have been truncated.
If there is anything else I could add to the kernel command line to get better logs, just ask and I will try again. I could not easily find on the wiki an authoritative list of debugging parameters, so I might not have made it print enough debugging noise.
I just did a bisect.
The symptoms change mid bisect (losing video at the initial modeset, instead of hanging completely at gdm), so I am not completely sure, but here are the results:
70ee6f1cd6911098ddd4c11ee21b69dbe51fb3f9 is the first bad commit
Author: Ben Skeggs <email@example.com>
Date: Fri Jul 13 16:49:49 2012 +1000
drm/nv04-nv40/fifo: remove use of nouveau_gpuobj_new_fake()
Signed-off-by: Ben Skeggs <firstname.lastname@example.org>
:040000 040000 29a991b723d037cfe7fb7a5dd3a34b8321e489d1 cb531c96db341f2340f62511cb7dc1c2b84cefc5 M drivers
And the bisect log:
# bad: [29594404d7fe73cd80eaa4ee8c43dcc53970c60e] Linux 3.7
# good: [a0d271cbfed1dd50278c6b06bead3d00ba0a88f9] Linux 3.6
git bisect start 'v3.7' 'v3.6' '--' 'drivers/gpu/drm/nouveau/'
# bad: [cd8c14b407d59ac4b8d324f5f9cdf223a2079c88] drm/nvc0/ltcg: read LTS count at startup
git bisect bad cd8c14b407d59ac4b8d324f5f9cdf223a2079c88
# bad: [c4afbe74cebf887d3d8e7a11aa93bebcb6a3e2e1] drm/nvc0-/gr: share headers between fermi and kepler graphics code
git bisect bad c4afbe74cebf887d3d8e7a11aa93bebcb6a3e2e1
# good: [0134a97979a0abc1c756b0fe491e074693c2bdf5] drm/nv50-/instmem: allocate vram for kernel objects from end of vram
git bisect good 0134a97979a0abc1c756b0fe491e074693c2bdf5
# bad: [8a9b889e668a5bc2f4031015fe4893005c43403d] drm/nouveau: remove last use of nouveau_gpuobj_new_fake()
git bisect bad 8a9b889e668a5bc2f4031015fe4893005c43403d
# bad: [70ee6f1cd6911098ddd4c11ee21b69dbe51fb3f9] drm/nv04-nv40/fifo: remove use of nouveau_gpuobj_new_fake()
git bisect bad 70ee6f1cd6911098ddd4c11ee21b69dbe51fb3f9
# good: [af7afbd2e1409168698bde2f2846848b07d05d12] drm/nv04-nv40/instmem: duplicate nv04 code as nv40, remove alternate paths
git bisect good af7afbd2e1409168698bde2f2846848b07d05d12
# good: [5787640db6ae722aeadb394d480c7ca21b603e34] drm/nv04-nv40/instmem: remove use of nouveau_gpuobj_new_fake()
git bisect good 5787640db6ae722aeadb394d480c7ca21b603e34
And my notes on the bisect kernels:
cd8c14b407d59ac4b8d324f5f9cdf223a2079c88 hangs at gdm
c4afbe74cebf887d3d8e7a11aa93bebcb6a3e2e1 hangs at initial modeset
8a9b889e668a5bc2f4031015fe4893005c43403d initial modeset: blank screen followed by dpms off, keyboard still works
70ee6f1cd6911098ddd4c11ee21b69dbe51fb3f9 again blank screen/dpms off at initial modeset
Searching for that commit id, I found https://bugzilla.kernel.org/show_bug.cgi?id=50091 which has very similar symptoms (hang on X server startup, garbled screen) and is also NV40 family.
Finally made it work on both 3.8 and 3.7.9: just add nouveau.noaccel=1 and there is no hang. Of course, that does not fix the bug, just avoids it.
Created attachment 75421 [details]
dmesg for 3.8 with nouveau.noaccel=1
Created attachment 75422 [details]
Xorg.0.log for 3.8 with nouveau.noaccel=1
Created attachment 79949 [details] [review]
initialize ramfc to zero
Just guessing, based on the commit itself. Does the patch help?
(In reply to comment #11)
> Created attachment 79949 [details] [review] [review]
> initialize ramfc to zero
> Just guessing, based on the commit itself. Does the patch help?
Applied on top of 3.9.4, did not help.
Similar downstream report at https://bugs.gentoo.org/show_bug.cgi?id=472200
I think I filed the same bug report, you can find it over here: https://bugs.freedesktop.org/show_bug.cgi?id=87361
I posted it in the Mesa/Dri/Nouveau section, but ain't sure it is supposed to be there.
Anyway someone on IRC suggested to me to use the kernel boot parameter: nouveau.config=NvMSI=0
This seems to fix the whole problem in my case and gives me a perfectly working system.
The person on IRC told me that the NvMSI may need to be blacklisted by default for the NV4C chipsets so it would work out of the box.
Hopefully this helps.
Greetings Jan Jasper de Kroon