Bug 61321

Summary: [regression][NV4c] System hang while loading gdm on 3.7 kernel (works on 3.6)
Product: xorg Reporter: Cesar Eduardo Barros <cesarb>
Component: Driver/nouveauAssignee: Nouveau Project <nouveau>
Status: RESOLVED MOVED QA Contact: Xorg Project Team <xorg-team>
Severity: major    
Priority: medium CC: comer352l, jwilk, TomWij
Version: unspecifiedKeywords: regression
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
See Also: https://bugzilla.kernel.org/show_bug.cgi?id=50091
https://bugs.freedesktop.org/show_bug.cgi?id=87361
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
First part of attempt with 3.7.9
none
Second part of attempt with 3.7.9
none
lspci -v
none
Xorg.0.log.old from attempt at kernel 3.8, copied after boot to working kernel
none
dmesg for 3.8 with nouveau.noaccel=1
none
Xorg.0.log for 3.8 with nouveau.noaccel=1
none
initialize ramfc to zero none

Description Cesar Eduardo Barros 2013-02-23 00:18:49 UTC
Since upgrading from a 3.6.10 kernel to a 3.7.9 kernel (Fedora 18), the system started hanging while loading gdm. Going back to the 3.6 kernel works, going to a 3.8 kernel hangs the same way.

The hang manifests first as the gdm spinning "loading" cursor stopping, followed a few seconds later by a screen full of a repeated noise pattern (different for each boot, but looks like a small block of noise tiled all over the screen). The machine does not answer to the keyboard (num lock/caps lock LEDs) or the network (by this point NetworkManager already started the interface, so I can get a few pings before it hangs).

In particular, it is not the initial modeset when loading the module which hangs; it is only gdm. If I boot with systemd.unit=multi-user.target, I can boot without any hangs, load netconsole, and isolate graphical.target to load gdm to see it hang. I used that to capture the kernel messages until the hang for the Fedora 3.7.9 kernel.

I also managed to capture the Xorg.0.log for the attempt with the 3.8 kernel (booting directly into graphical.target).

The relevant version numbers are:

kernel-3.6.10-4.fc18.x86_64 (works)
kernel-3.7.9-201.fc18.x86_64 (hangs)
Upstream 3.8 kernel with a Fedora config (hangs)
xorg-x11-server-Xorg-1.13.2-2.fc18.x86_64
xorg-x11-drv-nouveau-1.0.6-1.fc18.x86_64
libdrm-2.4.42-1.fc18.x86_64
mesa-dri-drivers-9.0.1-4.fc18.x86_64

I also attempted once booting 3.7.9 with nouveau.config=NvPCIE=0 as suggested at bug #58776; it did not make any visible difference.
Comment 1 Cesar Eduardo Barros 2013-02-23 00:20:09 UTC
Created attachment 75390 [details]
First part of attempt with 3.7.9

First part of attempt with the 3.7.9 kernel, captured with dmesg after booting into multi-user.target
Comment 2 Cesar Eduardo Barros 2013-02-23 00:21:38 UTC
Created attachment 75391 [details]
Second part of attempt with 3.7.9

Second part of attempt with the 3.7.9 kernel, captured with netconsole before and during systemd isolate of graphical.target
Comment 3 Cesar Eduardo Barros 2013-02-23 00:22:55 UTC
Created attachment 75392 [details]
lspci -v

lspci -v (with 3.6 kernel)
Comment 4 Cesar Eduardo Barros 2013-02-23 00:25:24 UTC
Created attachment 75393 [details]
Xorg.0.log.old from attempt at kernel 3.8, copied after boot to working kernel

This was from when I attempted booting into the 3.8 kernel. Notice the block of NUL bytes at end of the file, suggesting it might have been truncated.
Comment 5 Cesar Eduardo Barros 2013-02-23 00:28:24 UTC
If there is anything else I could add to the kernel command line to get better logs, just ask and I will try again. I could not easily find on the wiki an authoritative list of debugging parameters, so I might not have made it print enough debugging noise.
Comment 6 Cesar Eduardo Barros 2013-02-23 18:27:36 UTC
I just did a bisect.

The symptoms change mid bisect (losing video at the initial modeset, instead of hanging completely at gdm), so I am not completely sure, but here are the results:

70ee6f1cd6911098ddd4c11ee21b69dbe51fb3f9 is the first bad commit
commit 70ee6f1cd6911098ddd4c11ee21b69dbe51fb3f9
Author: Ben Skeggs <bskeggs@redhat.com>
Date:   Fri Jul 13 16:49:49 2012 +1000

    drm/nv04-nv40/fifo: remove use of nouveau_gpuobj_new_fake()
    
    Signed-off-by: Ben Skeggs <bskeggs@redhat.com>

:040000 040000 29a991b723d037cfe7fb7a5dd3a34b8321e489d1 cb531c96db341f2340f62511cb7dc1c2b84cefc5 M	drivers

And the bisect log:

# bad: [29594404d7fe73cd80eaa4ee8c43dcc53970c60e] Linux 3.7
# good: [a0d271cbfed1dd50278c6b06bead3d00ba0a88f9] Linux 3.6
git bisect start 'v3.7' 'v3.6' '--' 'drivers/gpu/drm/nouveau/'
# bad: [cd8c14b407d59ac4b8d324f5f9cdf223a2079c88] drm/nvc0/ltcg: read LTS count at startup
git bisect bad cd8c14b407d59ac4b8d324f5f9cdf223a2079c88
# bad: [c4afbe74cebf887d3d8e7a11aa93bebcb6a3e2e1] drm/nvc0-/gr: share headers between fermi and kepler graphics code
git bisect bad c4afbe74cebf887d3d8e7a11aa93bebcb6a3e2e1
# good: [0134a97979a0abc1c756b0fe491e074693c2bdf5] drm/nv50-/instmem: allocate vram for kernel objects from end of vram
git bisect good 0134a97979a0abc1c756b0fe491e074693c2bdf5
# bad: [8a9b889e668a5bc2f4031015fe4893005c43403d] drm/nouveau: remove last use of nouveau_gpuobj_new_fake()
git bisect bad 8a9b889e668a5bc2f4031015fe4893005c43403d
# bad: [70ee6f1cd6911098ddd4c11ee21b69dbe51fb3f9] drm/nv04-nv40/fifo: remove use of nouveau_gpuobj_new_fake()
git bisect bad 70ee6f1cd6911098ddd4c11ee21b69dbe51fb3f9
# good: [af7afbd2e1409168698bde2f2846848b07d05d12] drm/nv04-nv40/instmem: duplicate nv04 code as nv40, remove alternate paths
git bisect good af7afbd2e1409168698bde2f2846848b07d05d12
# good: [5787640db6ae722aeadb394d480c7ca21b603e34] drm/nv04-nv40/instmem: remove use of nouveau_gpuobj_new_fake()
git bisect good 5787640db6ae722aeadb394d480c7ca21b603e34

And my notes on the bisect kernels:

cd8c14b407d59ac4b8d324f5f9cdf223a2079c88 hangs at gdm
c4afbe74cebf887d3d8e7a11aa93bebcb6a3e2e1 hangs at initial modeset
0134a97979a0abc1c756b0fe491e074693c2bdf5 works
8a9b889e668a5bc2f4031015fe4893005c43403d initial modeset: blank screen followed by dpms off, keyboard still works
70ee6f1cd6911098ddd4c11ee21b69dbe51fb3f9 again blank screen/dpms off at initial modeset
af7afbd2e1409168698bde2f2846848b07d05d12 works
5787640db6ae722aeadb394d480c7ca21b603e34 works
Comment 7 Cesar Eduardo Barros 2013-02-23 18:35:27 UTC
Searching for that commit id, I found https://bugzilla.kernel.org/show_bug.cgi?id=50091 which has very similar symptoms (hang on X server startup, garbled screen) and is also NV40 family.
Comment 8 Cesar Eduardo Barros 2013-02-23 19:34:56 UTC
Finally made it work on both 3.8 and 3.7.9: just add nouveau.noaccel=1 and there is no hang. Of course, that does not fix the bug, just avoids it.
Comment 9 Cesar Eduardo Barros 2013-02-23 19:37:07 UTC
Created attachment 75421 [details]
dmesg for 3.8 with nouveau.noaccel=1
Comment 10 Cesar Eduardo Barros 2013-02-23 19:37:31 UTC
Created attachment 75422 [details]
Xorg.0.log for 3.8 with nouveau.noaccel=1
Comment 11 Maarten Lankhorst 2013-05-29 10:54:37 UTC
Created attachment 79949 [details] [review]
initialize ramfc to zero

Just guessing, based on the commit itself. Does the patch help?
Comment 12 Cesar Eduardo Barros 2013-06-03 11:18:30 UTC
(In reply to comment #11)
> Created attachment 79949 [details] [review] [review]
> initialize ramfc to zero
> 
> Just guessing, based on the commit itself. Does the patch help?

Applied on top of 3.9.4, did not help.
Comment 13 Tom Wijsman 2013-08-12 21:57:54 UTC
Similar downstream report at https://bugs.gentoo.org/show_bug.cgi?id=472200
Comment 14 Jan Jasper de Kroon 2014-12-16 16:27:29 UTC
I think I filed the same bug report, you can find it over here: https://bugs.freedesktop.org/show_bug.cgi?id=87361
I posted it in the Mesa/Dri/Nouveau section, but ain't sure it is supposed to be there.
Anyway someone on IRC suggested to me to use the kernel boot parameter: nouveau.config=NvMSI=0
This seems to fix the whole problem in my case and gives me a perfectly working system.
The person on IRC told me that the NvMSI may need to be blacklisted by default for the NV4C chipsets so it would work out of the box.
Hopefully this helps.

Greetings Jan Jasper de Kroon
Comment 15 Martin Peres 2019-12-04 08:32:38 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/xorg/driver/xf86-video-nouveau/issues/35.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.