Bug 97620

Summary: [REGRESSION, bisected] KMS can't initialize GeForce GTX 460 after commit a6a0f67ca7aae2e6bec7ebf55d1e4853dc220816
Product: xorg Reporter: Luke Tidd <lukeisgreat>
Component: Driver/nouveauAssignee: Nouveau Project <nouveau>
Status: RESOLVED MOVED QA Contact: Xorg Project Team <xorg-team>
Severity: critical    
Priority: medium CC: 21Naown, carl.lucas, dimhen, leonard, lukeisgreat, praxy, roucaries.bastien+bugs
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
dmesg_4.5.1-1-ARCH video working flawlessly
none
newer kernel with regression (no video)
none
dmesg_4.6.7-rt11-2-rt (without nomodeset)
none
dmesg_4.6.4-1-ARCH - in case you'd rather not bother with a realtime kernel
none
lspci -v
none
dmesg 4.5.0-rc7 nouveau.debug=debug log_buf_len=1M single
none
4.5.0-rc7 nouveau.debug=debug video=vesafb:off log_buf_len=1M
none
(broken) dmesg_4.7.4-1-ARCH nouveau.debug=debug,bios=trace
none
vbios.rom none

Description Luke Tidd 2016-09-07 03:39:04 UTC
Created attachment 126259 [details]
dmesg_4.5.1-1-ARCH video working flawlessly

Originally diagnosed in https://bbs.archlinux.org/viewtopic.php?id=216364

distro is Arch. Music studio machine with an nvidia GeForce GTX 460 and "ACHIEVA Shimian QH270-Lite 27" Wide QHD PC Monitor DVI-D 2560x1440" connected via DVI.

Have been using nouveau without issue since 2012. Kernel update from 4.5.1-1 to (4.6.4-1 or 4.6.7-rt11-2-rt) causes a lower than standard resolution during kernel messages then no video once X attempts to start. Reverting to 4.5.1-1 works perfectly. "nomodeset single" kernel options allow me to get to a terminal.

some package versions that don't change between kernels:
extra/mesa 12.0.2-1 [installed]
extra/xf86-video-nouveau 1.0.12-2 (xorg-drivers xorg) [installed]
Comment 1 Luke Tidd 2016-09-07 03:40:07 UTC
Created attachment 126260 [details]
newer kernel with regression (no video)
Comment 2 Ilia Mirkin 2016-09-07 03:41:59 UTC
Your 4.6 kernel is being booted with 'nomodeset'. That disables the nouveau kernel driver entirely.
Comment 3 Luke Tidd 2016-09-07 03:56:42 UTC
Created attachment 126262 [details]
dmesg_4.6.7-rt11-2-rt (without nomodeset)
Comment 4 Luke Tidd 2016-09-07 03:56:57 UTC
Of course. Sorry about that. PTAL
Comment 5 Luke Tidd 2016-09-07 04:07:12 UTC
Created attachment 126266 [details]
dmesg_4.6.4-1-ARCH - in case you'd rather not bother with a realtime kernel
Comment 6 Luke Tidd 2016-09-07 04:09:45 UTC
Created attachment 126267 [details]
lspci -v
Comment 7 Ilia Mirkin 2016-09-07 04:20:41 UTC
Hrmph. Well it looks like your GPU just goes nuts.

[    9.933619] nouveau 0000:01:00.0: bus: MMIO write of 00000002 FAULT at 13b154 [ IBUS ]
[    9.936957] nouveau 0000:01:00.0: bus: MMIO write of 00000000 FAULT at 61019c [ IBUS ]

followed by a ton more 610xxx errors. 610xxx is PDISPLAY stuff, but 13b154 is in PXBAR.GPC_*. I also can't find any way that we'd be writing to that register, at least nothing too direct. Something we do greatly upsets the chip though.

I looked through the nouveau changes between 4.5 and 4.6, and while there was a ton of stuff, nothing jumps out at me.

Can you do a git bisect on the kernel between v4.5 and v4.6 (you can restrict it to drivers/gpu/drm/nouveau) to figure out which change killed it?

Separately, you could see if it magically got fixed in v4.7 or v4.8-rcN.
Comment 8 Luke Tidd 2016-09-07 17:05:05 UTC
FYI a newer kernel (linux-git 4.8rc4.r0.g3eab887-1) didn't help. I also knocked my machine unreachable by trying to `git bisect` without specifying the limiting path. I'll test more when I get home from work and can power cycle the machine.
Comment 9 Luke Tidd 2016-09-18 23:19:46 UTC
Hi! Sorry that took so long. My other machine didn't have enough memory to run git bisect on the kernel.


Here's the goods. Any change marked bad exhibited the exact same symptoms (monitor turns off shortly after boot).

git bisect start '--' 'drivers/gpu/drm/nouveau'
# good: [b562e44f507e863c6792946e4e1b1449fbbac85d] Linux 4.5
git bisect good b562e44f507e863c6792946e4e1b1449fbbac85d
# bad: [523d939ef98fd712632d93a5a2b588e477a7565e] Linux 4.7
git bisect bad 523d939ef98fd712632d93a5a2b588e477a7565e
# bad: [3a91b9c5efd27729767edfde9df069aa61c4816f] drm/nouveau/clk/gk20a: fix VCO bit mask
git bisect bad 3a91b9c5efd27729767edfde9df069aa61c4816f
# bad: [a2e435a1b0a3c2bc766d40356151610cc54b8772] drm/nouveau/fifo/gk104: take runlist target into account
git bisect bad a2e435a1b0a3c2bc766d40356151610cc54b8772
# bad: [18cd5bc8ea587dc2fc0c07d2a4bf3cfe9ed2ef53] drm/nouveau/gr/gf100: load firmware in outer function
git bisect bad 18cd5bc8ea587dc2fc0c07d2a4bf3cfe9ed2ef53
# good: [1b82111faebc24427c76b83738566bda7f315225] drm/nouveau/device/tegra: fix uninitialized IRQ number
git bisect good 1b82111faebc24427c76b83738566bda7f315225
# bad: [989f57847396d1d042204747985d6aacf5399c8a] drm/nouveau/bios/devinit: rename INIT_DP_CONDITION to INIT_GENERIC_CONDITION
git bisect bad 989f57847396d1d042204747985d6aacf5399c8a
# good: [8fb1240a7152d450d57402b5b85ba46d8610d443] drm/nouveau/devinit/nv50: remove unneeded variable
git bisect good 8fb1240a7152d450d57402b5b85ba46d8610d443
# bad: [96aedd0ba9122a13fd0f756e022856ce7f05f086] drm/nouveau/ltc/gm107: fix slice intr offset
git bisect bad 96aedd0ba9122a13fd0f756e022856ce7f05f086
Comment 10 Ilia Mirkin 2016-09-18 23:21:30 UTC
(In reply to Luke Tidd from comment #9)
> Hi! Sorry that took so long. My other machine didn't have enough memory to
> run git bisect on the kernel.
> 
> 
> Here's the goods. Any change marked bad exhibited the exact same symptoms
> (monitor turns off shortly after boot).

Looks like you didn't finish? At the end it should say "commit XYZ is the first bad commit"
Comment 11 Luke Tidd 2016-09-18 23:55:56 UTC
You are right, sorry new to git bisect. Was thrown off by "0 more steps to do".



git bisect start '--' 'drivers/gpu/drm/nouveau'
# good: [b562e44f507e863c6792946e4e1b1449fbbac85d] Linux 4.5
git bisect good b562e44f507e863c6792946e4e1b1449fbbac85d
# bad: [523d939ef98fd712632d93a5a2b588e477a7565e] Linux 4.7
git bisect bad 523d939ef98fd712632d93a5a2b588e477a7565e
# bad: [3a91b9c5efd27729767edfde9df069aa61c4816f] drm/nouveau/clk/gk20a: fix VCO bit mask
git bisect bad 3a91b9c5efd27729767edfde9df069aa61c4816f
# bad: [a2e435a1b0a3c2bc766d40356151610cc54b8772] drm/nouveau/fifo/gk104: take runlist target into account
git bisect bad a2e435a1b0a3c2bc766d40356151610cc54b8772
# bad: [18cd5bc8ea587dc2fc0c07d2a4bf3cfe9ed2ef53] drm/nouveau/gr/gf100: load firmware in outer function
git bisect bad 18cd5bc8ea587dc2fc0c07d2a4bf3cfe9ed2ef53
# good: [1b82111faebc24427c76b83738566bda7f315225] drm/nouveau/device/tegra: fix uninitialized IRQ number
git bisect good 1b82111faebc24427c76b83738566bda7f315225
# bad: [989f57847396d1d042204747985d6aacf5399c8a] drm/nouveau/bios/devinit: rename INIT_DP_CONDITION to INIT_GENERIC_CONDITION
git bisect bad 989f57847396d1d042204747985d6aacf5399c8a
# good: [8fb1240a7152d450d57402b5b85ba46d8610d443] drm/nouveau/devinit/nv50: remove unneeded variable
git bisect good 8fb1240a7152d450d57402b5b85ba46d8610d443
# bad: [96aedd0ba9122a13fd0f756e022856ce7f05f086] drm/nouveau/ltc/gm107: fix slice intr offset
git bisect bad 96aedd0ba9122a13fd0f756e022856ce7f05f086
# bad: [a6a0f67ca7aae2e6bec7ebf55d1e4853dc220816] drm/nouveau/devinit/gf100-: detect if BIOS invoked devinit
git bisect bad a6a0f67ca7aae2e6bec7ebf55d1e4853dc220816
# first bad commit: [a6a0f67ca7aae2e6bec7ebf55d1e4853dc220816] drm/nouveau/devinit/gf100-: detect if BIOS invoked devinit
Comment 12 Ilia Mirkin 2016-09-19 03:01:59 UTC
(In reply to Luke Tidd from comment #11)
> # first bad commit: [a6a0f67ca7aae2e6bec7ebf55d1e4853dc220816]
> drm/nouveau/devinit/gf100-: detect if BIOS invoked devinit

That's interesting. It either used to run init tables and now doesn't, or conversely didn't use to and now does.

Would you mind running a "working" kernel with nouveau.debug=debug and see if it ever says "bios: running init tables"? If it does, you can probably make the new kernels work by adding nouveau.config=NvForcePost=1 . If it previously didn't run the tables and now does, there's no config-line way of overriding it.
Comment 13 Ilia Mirkin 2016-09-19 03:08:24 UTC
Another thing you could do is boot (any) kernel with nomodeset, and then using envytools (https://github.com/envytools/envytools/), do

nvapeek 2240c

This should output 1 line, either a number or just "..." [which means "0"], I would like to know what it is. Ideally this would be done on a cold boot.
Comment 14 Luke Tidd 2016-09-19 03:26:11 UTC
kernel 4.7.4-1-ARCH
nomodeset single

# nvapeek 2240c
...


testing working kernel now.
Comment 15 Ilia Mirkin 2016-09-19 03:33:28 UTC
(In reply to Luke Tidd from comment #14)
> kernel 4.7.4-1-ARCH
> nomodeset single
> 
> # nvapeek 2240c
> ...
> 
> 
> testing working kernel now.

Hm, that means that the bit we're checking is 0, which indicates that the VBIOS hasn't run. This is, of course, patently false, since it seems like this is the only video card in the system, and it's working (right? you're at the console, the screen is on, etc?)

And, I'm guessing, in the new kernel we're now running the vbios whereas we previously didn't use to. And running it a second time destroys the universe for some reason.

Separately, I just happened to notice that you have vesafb turned on. I wonder if this somehow upsets matters. Any chance you could flip that off?
Comment 16 Luke Tidd 2016-09-19 03:38:55 UTC
(In reply to Ilia Mirkin from comment #15)
> (In reply to Luke Tidd from comment #14)
> > kernel 4.7.4-1-ARCH
> > nomodeset single
> > 
> > # nvapeek 2240c
> > ...
> > 
> > 
> > testing working kernel now.
> 
> Hm, that means that the bit we're checking is 0, which indicates that the
> VBIOS hasn't run. This is, of course, patently false, since it seems like
> this is the only video card in the system, and it's working (right? you're
> at the console, the screen is on, etc?)
Yes, working, screen is on, this is the only video card.

> 
> And, I'm guessing, in the new kernel we're now running the vbios whereas we
> previously didn't use to. And running it a second time destroys the universe
> for some reason.
> 
> Separately, I just happened to notice that you have vesafb turned on. I
> wonder if this somehow upsets matters. Any chance you could flip that off?

Sure, just a sec
Comment 17 Luke Tidd 2016-09-19 03:40:27 UTC
Created attachment 126609 [details]
dmesg 4.5.0-rc7 nouveau.debug=debug log_buf_len=1M single
Comment 18 Luke Tidd 2016-09-19 03:43:46 UTC
(In reply to Ilia Mirkin from comment #12)
> (In reply to Luke Tidd from comment #11)
> > # first bad commit: [a6a0f67ca7aae2e6bec7ebf55d1e4853dc220816]
> > drm/nouveau/devinit/gf100-: detect if BIOS invoked devinit
> 
> That's interesting. It either used to run init tables and now doesn't, or
> conversely didn't use to and now does.
> 
> Would you mind running a "working" kernel with nouveau.debug=debug and see
> if it ever says "bios: running init tables"? If it does, you can probably
> make the new kernels work by adding nouveau.config=NvForcePost=1 . If it
> previously didn't run the tables and now does, there's no config-line way of
> overriding it.


https://bugs.freedesktop.org/attachment.cgi?id=126609
I don't see "bios: running init tables" but here is the entire dmesg.
Comment 19 Luke Tidd 2016-09-19 03:46:07 UTC
Created attachment 126610 [details]
4.5.0-rc7 nouveau.debug=debug video=vesafb:off log_buf_len=1M
Comment 20 Ilia Mirkin 2016-09-19 03:47:31 UTC
(In reply to Luke Tidd from comment #18)
> https://bugs.freedesktop.org/attachment.cgi?id=126609
> I don't see "bios: running init tables" but here is the entire dmesg.

Indeed. So the "working" state is to not run the VBIOS. Something in there is upsetting matters (normally running the VBIOS a second time doesn't hurt anything, it's just pointless work and can cause a flicker). Assuming that disabling vesafb doesn't help (the way those legacy interfaces are implemented can be pretty brittle), please include

(a) /sys/kernel/debug/dri/0/vbios.rom (from any boot)
(b) nouveau.debug=debug,bios=trace output from a *broken* kernel boot

Ideally that should provide some further things to look at.
Comment 21 Luke Tidd 2016-09-19 04:00:49 UTC
# stat /sys/kernel/debug/dri/0/vbios.rom
  File: '/sys/kernel/debug/dri/0/vbios.rom'
  Size: 0         	Blocks: 0          IO Block: 4096   regular empty file
Device: 7h/7d	Inode: 13648       Links: 1
Access: (0444/-r--r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Access: 2016-09-18 23:52:15.396661809 -0400
Modify: 2016-09-18 23:52:15.396661809 -0400
Change: 2016-09-18 23:52:15.396661809 -0400
 Birth: -

dri has /0 /64 and /128 and all vbios.rom in each are empty.
Comment 22 Ilia Mirkin 2016-09-19 04:02:12 UTC
(In reply to Luke Tidd from comment #21)
> # stat /sys/kernel/debug/dri/0/vbios.rom
>   File: '/sys/kernel/debug/dri/0/vbios.rom'
>   Size: 0         	Blocks: 0          IO Block: 4096   regular empty file
> Device: 7h/7d	Inode: 13648       Links: 1
> Access: (0444/-r--r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
> Access: 2016-09-18 23:52:15.396661809 -0400
> Modify: 2016-09-18 23:52:15.396661809 -0400
> Change: 2016-09-18 23:52:15.396661809 -0400
>  Birth: -
> 
> dri has /0 /64 and /128 and all vbios.rom in each are empty.

stat can be deceiving. Try cp or cat. (The vbios.rom files in all 3 should be identical btw.)
Comment 23 Luke Tidd 2016-09-19 04:02:23 UTC
Created attachment 126611 [details]
(broken) dmesg_4.7.4-1-ARCH nouveau.debug=debug,bios=trace
Comment 24 Luke Tidd 2016-09-19 04:07:08 UTC
Created attachment 126612 [details]
vbios.rom
Comment 25 Ilia Mirkin 2016-09-19 04:50:16 UTC
(In reply to Luke Tidd from comment #23)
> Created attachment 126611 [details]
> (broken) dmesg_4.7.4-1-ARCH nouveau.debug=debug,bios=trace

I am very confused by the lack of trace output for the bios execution. That should have worked. It's clearly running the tables...

[    1.960340] nouveau 0000:01:00.0: devinit: running init tables

Secondly, one of your bios init scripts contains the command

0x8209: 7a 54 b1 13 00 02 00 00 00                     ZM_REG   R[0x13b154] = 0x00000002

in init script 3 (starting at 0x8116). This is the cause of

[    4.108908] nouveau 0000:01:00.0: bus: MMIO write of 00000002 FAULT at 13b154 [ IBUS ]

However that is only reported when GT214_DISP is being initialized. Again, odd, but I'm not 100% sure how the init scripts are executed. This might have to be one for Ben...
Comment 26 Ilia Mirkin 2016-09-19 05:08:00 UTC
0x13b154 = PXBAR.GPC_UNK1[0x2]+0x54

However GF104 only has 2 GPC's. Odd. At least the MMIO write failure makes sense. I wonder if we're supposed to filter those out. That'd be pretty annoying...

Another curious point is that this VBIOS does not write 0x2240c anywhere. I wonder if that only got standardized later on, not starting with GF100.
Comment 27 Ilia Mirkin 2016-09-19 05:34:29 UTC
Looking at our VBIOS repo, we have a bunch (but not all) of GF104's that are missing the 0x2240c write as well as one GK107. Every other (GF100+) GPU sets 0x2240c. I guess it's not as reliable as we'd hoped.

This is how I checked:

for i in nv[cdef]?/*/*.rom nv1??/*/*.rom; do echo $i; nvbios $i |& grep 0x02240c; done

And looked for files without a 'NV_REG   R[0x02240c] &= 0xfffffffd |= 0x00000002' command.

Although that in itself shouldn't be a major issue - rerunning the VBIOS shouldn't cause issues...

Looking at a mmiotrace that came with one of those GF104 VBIOSes that are missing a 0x2240c write, it doesn't seem like the VBIOS gets executed there, or at least not that script. It's late and I could have messed up though.
Comment 28 Ilia Mirkin 2016-09-20 04:13:28 UTC
Obviously not the "right" fix, but you could probably get going by replacing

drivers/gpu/drm/nouveau/nvkm/subdev/devinit/gf100.c:    .preinit = gf100_devinit_preinit

with .preinit = nv50_devinit_preinit

which should get you back to the old logic for determining whether to run init scripts or not.
Comment 29 roucaries.bastien+bugs 2016-11-06 14:06:37 UTC
Some symptoms with a GFX970.

Something strange is without signed firmware I can get something work but in lower resolution with kernel > 4.5 

With signed firmware it crash during kms and black screen. no reboot except with button
Comment 30 Pierre Moreau 2016-11-06 14:39:24 UTC
(In reply to roucaries.bastien+bugs from comment #29)
> Some symptoms with a GFX970.
> 
> Something strange is without signed firmware I can get something work but in
> lower resolution with kernel > 4.5 
> 
> With signed firmware it crash during kms and black screen. no reboot except
> with button

This is most likely https://bugs.freedesktop.org/show_bug.cgi?id=94990
Comment 31 Carl Lucas 2017-01-27 18:04:53 UTC
Hello,

Is there any solution in sight for this? I have this card, and I am affected by this issue. I'm currently using the bandaid suggested by Ilia in Comment #29, but obviously that's less than ideal, and I'd prefer to go back to using my distro's stock kernel.

For reference, this particular card seems to have weird issues, even for mama nVidia: https://devtalk.nvidia.com/default/topic/959156/-quot-rminitadapter-failed-quot-with-370-23-but-367-35-works-fine/

No idea of the issue there is related at all, but it's another example of a hard-to-track-down problem that affected ONLY this card. (I'm clucas84 in that thread, btw - it was really frustrating at the time, because I couldn't get X at all, since nvidia was busted, and nouveau was affected by this issue).

I'm happy to do any debugging work that you want, including replicating any of the debug dumps already done by Luke Tidd if you think they will help, though I am time-constrained to the weekends for that sort of thing.

Otherwise, I think I'll give up on this card and get a new one - it's served me well, but it seems like it's a bit too quirky.

In any case, thank you very much for your time, and everyone's excellent work on the nouveau driver!

Best,
Carl
Comment 32 Luke Tidd 2017-01-27 19:05:00 UTC
If a developer is interested in this card, please contact me. You can have it, I replaced it as I rely on this machine.
Comment 33 Ilia Mirkin 2017-01-28 04:03:48 UTC
Could someone see if https://patchwork.freedesktop.org/patch/135772/ improves the situation?
Comment 34 Carl Lucas 2017-01-28 20:38:59 UTC
(In reply to Ilia Mirkin from comment #33)
> Could someone see if https://patchwork.freedesktop.org/patch/135772/
> improves the situation?

Yes it does - I was able to boot just fine.

I did have to add #include <subdev/vga.h> though, for nvkm_rdvgac().
Comment 35 Ilia Mirkin 2017-01-28 20:51:16 UTC
(In reply to Carl Lucas from comment #34)
> (In reply to Ilia Mirkin from comment #33)
> > Could someone see if https://patchwork.freedesktop.org/patch/135772/
> > improves the situation?
> 
> Yes it does - I was able to boot just fine.
> 
> I did have to add #include <subdev/vga.h> though, for nvkm_rdvgac().

Right, of course, I had that change locally but forgot to stick it into the commit. I'll send a v2 and cc you.
Comment 36 Ben Skeggs 2017-01-31 09:46:57 UTC
Are you able to grab a mmiotrace[1] of the NVIDIA binary driver initialising this GPU for me please?

Ben.

[1] https://nouveau.freedesktop.org/wiki/MmioTrace/
Comment 37 Carl Lucas 2017-01-31 14:50:08 UTC
(In reply to Ben Skeggs from comment #36)
> Are you able to grab a mmiotrace[1] of the NVIDIA binary driver initialising
> this GPU for me please?
> 
> Ben.
> 
> [1] https://nouveau.freedesktop.org/wiki/MmioTrace/

Absolutely - I probably won't be able to get to it until this Saturday (2/4) though. I'll post the results once I've got them.
Comment 38 21Naown 2018-11-04 00:39:21 UTC
Hello,

Do you have some information about the progress of the fix?
Comment 39 Martin Peres 2019-12-04 09:17:21 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/xorg/driver/xf86-video-nouveau/issues/286.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.