89664 – Nouveau fails to enter KMS with the Gigabyte G1 Gaming GTX970

Bug 89664 - Nouveau fails to enter KMS with the Gigabyte G1 Gaming GTX970

Summary: Nouveau fails to enter KMS with the Gigabyte G1 Gaming GTX970

Status:	RESOLVED MOVED

Alias:	None

Product:	xorg
Classification:	Unclassified
Component:	Driver/nouveau (show other bugs)
Version:	unspecified
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium normal
Assignee:	Nouveau Project
QA Contact:	Xorg Project Team

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2015-03-18 19:17 UTC by Omar
Modified:	2019-12-04 08:57 UTC (History)
CC List:	2 users (show)

See Also:
i915 platform:
i915 features:

Attachments
VBIOS file (135.00 KB, application/octet-stream) 2015-03-18 19:20 UTC, Omar	no flags	Details
journalctl dump of the boot (1.65 MB, text/plain) 2015-03-18 19:29 UTC, Omar	no flags	Details
Journal nouveau grep (9.27 KB, text/plain) 2015-10-22 21:42 UTC, Omar	no flags	Details
Photo of the colored stripes bar (259.23 KB, image/jpeg) 2015-10-22 21:47 UTC, Omar	no flags	Details
Journal nouveau grep, NvForcePost enabled (9.92 KB, text/plain) 2015-10-22 22:09 UTC, Omar	no flags	Details
Kernel 4.8.12, nouveau defaults (26.30 KB, text/plain) 2016-12-09 20:27 UTC, Omar	no flags	Details
Kernel 4.8.12, nouveau NvForcePost (448.07 KB, text/plain) 2016-12-09 20:28 UTC, Omar	no flags	Details
Kernel module built from master, nouveau.debug=debug (60.00 KB, text/x-log) 2017-05-25 09:46 UTC, Mikołaj Świątek	no flags	Details
nouveau master, trace (122.79 KB, text/x-log) 2017-05-26 15:19 UTC, Mikołaj Świątek	no flags	Details
View All

Description Omar 2015-03-18 19:17:10 UTC

Using the kernel from the linux-4.0 branch I attempted to use nouveau with my GTX970.
At the moment the nouveau driver started and the system should switch to KMS, the screen turned black instead.


DMESG output was cut of due to to many message from nouveau.
As such I attached the journalctl output of that boot session instead.

Comment 1 Omar 2015-03-18 19:20:49 UTC

Created attachment 114449 [details]
VBIOS file

VBIOS retrieved from using:
echo 1 > /sys/bus/pci/devices/<pciid>/rom; cat /sys/bus/pci/devices/<pciid>/rom > vbios.rom; echo 0 > /sys/bus/pci/devices/<pciid>/rom


Original BIOS installation files can be found here:
http://www.gigabyte.com/products/product-page.aspx?pid=5209#bios

My card uses the F13 Bios.

Comment 2 Omar 2015-03-18 19:29:57 UTC

Created attachment 114450 [details]
journalctl dump of the boot

The journal dump I mentioned in the OP.
Capped around line 15K as the original totalled at almost 50K, crossing the 3MB file size limit.

Comment 3 Ilia Mirkin 2015-03-18 19:51:36 UTC

As I mentioned on IRC, there are actually 2x GM204's. And *neither* says that it's running its vbios tables, which is very odd. Try booting with

nouveau.config=NvForcePost=1

Perhaps we're misdetecting the posting-ness state of things. The card that's bitching is 0000:01:00.0 but the :2 one is probed by nouveau first, not sure if that's significant.

If this doesn't help. I'd recommend retesting with just one of them plugged in.

Comment 4 Ilia Mirkin 2015-10-22 07:05:46 UTC

Please check with kernel 4.1 or later -- some maxwell init issues were hopefully addressed there.

Comment 5 Omar 2015-10-22 11:24:54 UTC

Will the official kernel release do or do you want me to compile the Nouveau GIT kernel?

Comment 6 Ilia Mirkin 2015-10-22 16:11:21 UTC

(In reply to Omar from comment #5)
> Will the official kernel release do or do you want me to compile the Nouveau
> GIT kernel?

Any official kernel (4.1 or later) will do.

Comment 7 Omar 2015-10-22 21:42:22 UTC

I can safely drop nouveau.config=NvForcePost=1 and KMS works on atleast 1 out of 2 cards.
I'm also able to run GDM and start a Gnome session on both Xorg and Wayland (albeit the latter lagging quite badly).

It still has issues with the second card though (which ironically is the first card when speaking in terms of PCI slots; 01:00.0).
It seems as if that card is still not initialized and the monitors just show a funky coloured bar made up of small vertical striped (see attachment).

I did a quick grep for nouveau on the boot journal and attached it. If you want more info I'll be happy to try and provide :)

Comment 8 Omar 2015-10-22 21:42:52 UTC

Created attachment 119115 [details]
Journal nouveau grep

Comment 9 Ilia Mirkin 2015-10-22 21:44:38 UTC

(In reply to Omar from comment #7)
> I can safely drop nouveau.config=NvForcePost=1 and KMS works on atleast 1
> out of 2 cards.
> 
> It still has issues with the second card though (which ironically is the
> first card when speaking in terms of PCI slots; 01:00.0).

Which kernel did you try this with? Does using nouveau.config=NvForcePost=1 allow both GPUs to initialize properly?

Comment 10 Omar 2015-10-22 21:47:28 UTC

Created attachment 119116 [details]
Photo of the colored stripes bar

Comment 11 Omar 2015-10-22 21:48:11 UTC

Linux Omar-PC 4.2.3-1-ARCH #1 SMP PREEMPT Sat Oct 3 18:52:50 CEST 2015 x86_64 GNU/Linux

I'll give it a go. I'll update you in a bit :)

Comment 12 Omar 2015-10-22 22:09:31 UTC

They're staying black as opposed to showing a coloured bar but I'm still unable to get anything to display on them.

Comment 13 Omar 2015-10-22 22:09:53 UTC

Created attachment 119118 [details]
Journal nouveau grep, NvForcePost enabled

Comment 14 Ehsan Azar 2016-06-15 14:46:01 UTC

I had the exact same issue with GM107:

```
ehsan@machine:~$ lspci -nn | grep -i vga
03:00.0 VGA compatible controller [0300]: NVIDIA Corporation GM107GL [Quadro K620] [10de:13bb] (rev a2)
```

Adding `nouveau.config=NvForcePost=1` to the kernel command line fixed it for me too, thanks Omar. This is `Ubuntu 16.04 LTS` with kernel version `4.4.0-24-generic #43-Ubuntu SMP Wed Jun 8 19:27:37 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux`

Even the colored striped bars look similar, and I hear a noise I think might have to do with incorrect frequency setting for the monitor.

Comment 15 Omar 2016-12-09 20:26:52 UTC

I thought I'd give this another shot to see if things have changed since the last few kernel releases.

Tested on:
Linux Omar-PC 4.8.12-3-ARCH #1 SMP PREEMPT Thu Dec 8 16:10:23 CET 2016 x86_64 GNU/Linux

Nouveau default behaviour:
- Main GPU shortly blinks with boot information on the first screen followed by the second screen, then both power down.
- The monitor attached to the second GPU shows the coloured stripes.
- Turning the monitor on the second GPU off and back on again made it render a bright blue opaque screen.

Nouveau with config=NvForcePost=1 behaviour:
- All monitors power down, nothing is shown at all on either GPU.

I'll attach the output of `journalctl -b --no-pager | grep nouveau` for both runs shortly.

Comment 16 Omar 2016-12-09 20:27:37 UTC

Created attachment 128399 [details]
Kernel 4.8.12, nouveau defaults

Comment 17 Omar 2016-12-09 20:28:01 UTC

Created attachment 128400 [details]
Kernel 4.8.12, nouveau NvForcePost

Comment 18 Jonas Nordlund 2017-04-10 17:53:55 UTC

Is this perhaps bug 94990?

I also own an NVIDIA GTX 970 (Asus) and I can see the Gigabyte G1 Gaming GTX970 also has 4 GB VRAM. The problem here could be that NVIDIA made a controversial decision in making 3.5 GB VRAM full speed, but a special, subpar configuration for the remaining 0.5 GB making it approximately 7x slower than the "main" VRAM.

And noveau doesn't support this.

There are some tries with a hack in the bug above to have nouveau only use the initial 3.5 GB but I don't think it was successful yet as of April 2017.

It's too bad because it cripples the GTX 970 on most Linux distros, leaving one with the remaining option to add the nomodeset option to the Linux boot options to get an initial low resolution unaccelerated boot, and while running that, blacklisting the nouveau driver and then finally installing the proprietary NVIDIA driver. A reboot after this, and it will (should) work with the caveat that caution is necessary with proprietary drivers and Kernel updates especially if using so called "rolling release" distros.

I'm still having my hopes up for a fix! Would be the single most useful fix for the past two years for me.

Comment 19 Ilia Mirkin 2017-04-10 18:10:15 UTC

(In reply to Jonas Nordlund from comment #18)
> Is this perhaps bug 94990?

If it is, should be solved by using drm-next.

> It's too bad because it cripples the GTX 970 on most Linux distros, leaving
> one with the remaining option to add the nomodeset option to the Linux boot

You could also boot with

nouveau.config=gr=0,sec=0

which should disable the graph unit initialization, causing the trouble. (Might be secboot and not sec.) Note, I haven't tested this, going from memory on how this works.

Comment 20 Omar 2017-04-28 19:25:52 UTC

I suspect these are not the same problem tbh.
Anyway, I've tried to run nouveau once again. I've tried the default kernel, the nouveau.config arguments on the default kernel, and using the linux-4.12 branch of github/skeggsb/linux.git.

Default kernel (Linux Omar-PC 4.10.13-1-ARCH #1 SMP PREEMPT Thu Apr 27 12:15:09 CEST 2017 x86_64 GNU/Linux):
- Main GPU blinks with boot information on the first monitor. It then loses signal and the second monitor turns on showing the top and bottom of the boot information that was on screen, everything else is distorted and it shows the coloured stripes along the top (it does so in a higher resolution than the monitor on the second GPU).
- Turning the first monitor off and back on makes the second monitor lose signal and the contents displayed are now shown on the first monitor.
Turning the second off and on after that flips things around again. It seems the last monitor to be turned on gets to display the (distorted) output.
- The monitor attached to the second GPU shows the coloured stripes. Turning it off and back on again and it only shows black (it does have a signal).

Default kernel with module arguments gr+sec/secboot:
- Both monitors on the primary GPU lose signal. Turning them off and on changes nothing.
- Monitor on the secondary GPU is black (does get a signal). Turning off and back on again and it loses signal.

linux-4.12 branch of github/skeggsb/linux.git:
- Same as above with the module arguments, except the monitor on the second GPU does not lose signal after turning it off and on again.
I am assuming this is still the full kernel with the latest nouveau kernel module as stated on the wiki? Otherwise I guess this test was pretty meaningless.

If you want certain logs of any/all of the runs, please drop a message with what you need and I'll do another run to get the requested information. I'm also occasionally on the IRC so you can also reach me there :)
I did see some errors regarding "link training failed" (these have always been there for me every time iirc) and "DRM EVO timeout" (I believe these are new to me. They do not ring a bell).
If I have some time this weekend I'll give things a shot when unplugging the second GPU to see if this changes any behaviour with the 4.10 kernel or not.

Comment 21 Omar 2017-04-30 20:34:43 UTC

I've just done the runs with the first GPU installed only. It's showing the exact same behaviour as before.

Comment 22 Mikołaj Świątek 2017-05-20 16:46:43 UTC

I think I'm suffering from the same bug, though I've only been able to experience it after the VRAM detection problem was fixed in 4.12.

I have a MSI GTX 970 4GB feeding two 1080p monitors - one via HDMI and one via DP.  
With kernel 4.12-rc1, and a single monitor connected, everything works fine, on both monitors individually. However, when I try to boot with both connected at the same time, I get corrupted output similar to what Omar described - a coloured stripe at the top of the screen, and what looks like garbled text below it. Output only appears on one monitor, second stays in standby mode. This happens irrespective of whether I boot straight into X or single-user mode. None of the nouveau.config options mentioned do anything different than what Omar reported.

I have some logs captured with nouveau.debug=debug, and can upload them if needed.

Comment 23 Ilia Mirkin 2017-05-20 16:55:01 UTC

Have a look at bug #100676. Is it the same issue? (Please test the patches that were provided there.)

Comment 24 Mikołaj Świątek 2017-05-22 18:47:39 UTC

(In reply to Ilia Mirkin from comment #23)
> Have a look at bug #100676. Is it the same issue? (Please test the patches
> that were provided there.)

I don't *think* it's the exact same issue. The screen photo in that report is similar to what I see, though my text output is completely unintelligible, rather than just slightly corrupted. The reason I thought it was this bug was that the log messages about unknown connectors and failing to create encoders are identical with Omar's in my case.

In any event, I've built the kernel module from Ben's tree and it didn't change anything at all, same behaviour.

Comment 25 Ben Skeggs 2017-05-22 22:18:38 UTC

(In reply to Mikołaj Świątek from comment #24)
> (In reply to Ilia Mirkin from comment #23)
> > Have a look at bug #100676. Is it the same issue? (Please test the patches
> > that were provided there.)
> 
> I don't *think* it's the exact same issue. The screen photo in that report
> is similar to what I see, though my text output is completely
> unintelligible, rather than just slightly corrupted. The reason I thought it
> was this bug was that the log messages about unknown connectors and failing
> to create encoders are identical with Omar's in my case.
> 
> In any event, I've built the kernel module from Ben's tree and it didn't
> change anything at all, same behaviour.

Can I see your kernel log output from that please?  Bonus points if you boot with "log_buf_len=8M nouveau.debug=trace".

Comment 26 Mikołaj Świątek 2017-05-25 09:46:28 UTC

Created attachment 131502 [details]
Kernel module built from master, nouveau.debug=debug

nouveau.debug=trace resulted in too much spam for journald to handle...

Comment 27 Mikołaj Świątek 2017-05-25 09:51:18 UTC

(In reply to Ben Skeggs from comment #25)
> (In reply to Mikołaj Świątek from comment #24)
> > (In reply to Ilia Mirkin from comment #23)
> > > Have a look at bug #100676. Is it the same issue? (Please test the patches
> > > that were provided there.)
> > 
> > I don't *think* it's the exact same issue. The screen photo in that report
> > is similar to what I see, though my text output is completely
> > unintelligible, rather than just slightly corrupted. The reason I thought it
> > was this bug was that the log messages about unknown connectors and failing
> > to create encoders are identical with Omar's in my case.
> > 
> > In any event, I've built the kernel module from Ben's tree and it didn't
> > change anything at all, same behaviour.
> 
> Can I see your kernel log output from that please?  Bonus points if you boot
> with "log_buf_len=8M nouveau.debug=trace".

Tried to do it that way, but it somehow resulted in so much output that I couldn't even see the boot log with journalctl. I guess the kernel ring buffer fills up and  the beginning gets overwritten before journald can read it? Not an expert at debugging kernel modules by any stretch, so let me know if I'm missing something obvious here.

In the meantime, uploaded a run with nouveau.debug=debug.

Comment 28 Ben Skeggs 2017-05-26 02:51:00 UTC

(In reply to Mikołaj Świątek from comment #27)
> (In reply to Ben Skeggs from comment #25)
> > (In reply to Mikołaj Świątek from comment #24)
> > > (In reply to Ilia Mirkin from comment #23)
> > > > Have a look at bug #100676. Is it the same issue? (Please test the patches
> > > > that were provided there.)
> > > 
> > > I don't *think* it's the exact same issue. The screen photo in that report
> > > is similar to what I see, though my text output is completely
> > > unintelligible, rather than just slightly corrupted. The reason I thought it
> > > was this bug was that the log messages about unknown connectors and failing
> > > to create encoders are identical with Omar's in my case.
> > > 
> > > In any event, I've built the kernel module from Ben's tree and it didn't
> > > change anything at all, same behaviour.
> > 
> > Can I see your kernel log output from that please?  Bonus points if you boot
> > with "log_buf_len=8M nouveau.debug=trace".
> 
> Tried to do it that way, but it somehow resulted in so much output that I
> couldn't even see the boot log with journalctl. I guess the kernel ring
> buffer fills up and  the beginning gets overwritten before journald can read
> it? Not an expert at debugging kernel modules by any stretch, so let me know
> if I'm missing something obvious here.
> 
> In the meantime, uploaded a run with nouveau.debug=debug.

I can't tell 100% for sure from that, but, there's *strong* evidence there to suggest that yes, you are indeed seeing the bug Ilia mentioned.  You can probably work around it in the meantime by plugging one of your displays into another connector, or by trying the tree suggested in the other bug.

Comment 29 Mikołaj Świątek 2017-05-26 15:14:36 UTC

(In reply to Ben Skeggs from comment #28)
> (In reply to Mikołaj Świątek from comment #27)
> > (In reply to Ben Skeggs from comment #25)
> > > (In reply to Mikołaj Świątek from comment #24)
> > > > (In reply to Ilia Mirkin from comment #23)
> > > > > Have a look at bug #100676. Is it the same issue? (Please test the patches
> > > > > that were provided there.)
> > > > 
> > > > I don't *think* it's the exact same issue. The screen photo in that report
> > > > is similar to what I see, though my text output is completely
> > > > unintelligible, rather than just slightly corrupted. The reason I thought it
> > > > was this bug was that the log messages about unknown connectors and failing
> > > > to create encoders are identical with Omar's in my case.
> > > > 
> > > > In any event, I've built the kernel module from Ben's tree and it didn't
> > > > change anything at all, same behaviour.
> > > 
> > > Can I see your kernel log output from that please?  Bonus points if you boot
> > > with "log_buf_len=8M nouveau.debug=trace".
> > 
> > Tried to do it that way, but it somehow resulted in so much output that I
> > couldn't even see the boot log with journalctl. I guess the kernel ring
> > buffer fills up and  the beginning gets overwritten before journald can read
> > it? Not an expert at debugging kernel modules by any stretch, so let me know
> > if I'm missing something obvious here.
> > 
> > In the meantime, uploaded a run with nouveau.debug=debug.
> 
> I can't tell 100% for sure from that, but, there's *strong* evidence there
> to suggest that yes, you are indeed seeing the bug Ilia mentioned.  You can
> probably work around it in the meantime by plugging one of your displays
> into another connector, or by trying the tree suggested in the other bug.

Well, for bug 100676 you suggest using your master branch, which I'm already doing. Still, based on that report, I tried booting with "log_buf_len=8M drm.debug=0x14 nouveau.debug=disp=trace,i2c=trace,bios=trace", which gave legible output, attaching log in the hope that it helps. 

Can't really use a different connector for reasons, so for the time being I'm stuck using nvidia's driver.

Comment 30 Mikołaj Świątek 2017-05-26 15:19:26 UTC

Created attachment 131522 [details]
nouveau master, trace

Module built from bskeggs/nouveau master, booted with "log_buf_len=8M drm.debug=0x14 nouveau.debug=disp=trace,i2c=trace,bios=trace".

Comment 31 Ben Skeggs 2017-05-27 01:34:03 UTC

(In reply to Mikołaj Świątek from comment #30)
> Created attachment 131522 [details]
> nouveau master, trace
> 
> Module built from bskeggs/nouveau master, booted with "log_buf_len=8M
> drm.debug=0x14 nouveau.debug=disp=trace,i2c=trace,bios=trace".

According to the log messages here, this output isn't from the code on my master branch.  Perhaps another version of the module got loaded instead?

Comment 32 Mikołaj Świątek 2017-05-27 14:26:32 UTC

(In reply to Ben Skeggs from comment #31)
> (In reply to Mikołaj Świątek from comment #30)
> > Created attachment 131522 [details]
> > nouveau master, trace
> > 
> > Module built from bskeggs/nouveau master, booted with "log_buf_len=8M
> > drm.debug=0x14 nouveau.debug=disp=trace,i2c=trace,bios=trace".
> 
> According to the log messages here, this output isn't from the code on my
> master branch.  Perhaps another version of the module got loaded instead?

Yep, apparently I forgot to regenerate initramfs. Sorry for the confusion, seems that you were right and it was actually bug 100676.

With using the correct module, both displays work fine in single-user mode, but trying to start an X session results in a kernel BUG in ttm_bo_vm_fault, which I guess should be reported elsewhere.

Comment 33 caguduzexi 2018-01-29 14:10:00 UTC

I wont recommend using/keeping the GM204 (GTX 970). It cant ever run with free software: https://www.theregister.co.uk/2015/04/15/nvidia_gtx_900_linux_driver_roadbloack/
https://www.phoronix.com/scan.php?page=news_item&px=Nouveau-XDC2017
https://www.phoronix.com/scan.php?page=news_item&px=Nouveau-XDC2016-NVIDIA

Sell this crappy GP102 card away and go away from nvidia. Nvidia died with the 780ti card. Its the last end-user card that can be used normaly. Everything else is in some countries even a legal problem. Because the manufacturer (nvidia) blocks the users from beeing able to boot the software they want on THEIR hardware - happyly illegal in some countries. Hopefully some layer would sue the heck out of nvidia so that they would have to release the private signing key or close their doors.
Blocking the freedom of the users on such way should not be accepted by anyone.

Comment 34 Martin Peres 2018-01-29 17:34:39 UTC

(In reply to caguduzexi from comment #33)
> I wont recommend using/keeping the GM204 (GTX 970). It cant ever run with
> free software:
> https://www.theregister.co.uk/2015/04/15/
> nvidia_gtx_900_linux_driver_roadbloack/
> https://www.phoronix.com/scan.php?page=news_item&px=Nouveau-XDC2017
> https://www.phoronix.com/scan.php?page=news_item&px=Nouveau-XDC2016-NVIDIA
> 
> Sell this crappy GP102 card away and go away from nvidia. Nvidia died with
> the 780ti card. Its the last end-user card that can be used normaly.
> Everything else is in some countries even a legal problem. Because the
> manufacturer (nvidia) blocks the users from beeing able to boot the software
> they want on THEIR hardware - happyly illegal in some countries. Hopefully
> some layer would sue the heck out of nvidia so that they would have to
> release the private signing key or close their doors.
> Blocking the freedom of the users on such way should not be accepted by
> anyone.

User banned

Comment 35 Martin Peres 2019-12-04 08:57:51 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/xorg/driver/xf86-video-nouveau/issues/178.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.