Summary: | [REGRESSION, bisected] KMS can't initialize GeForce GTX 460 after commit a6a0f67ca7aae2e6bec7ebf55d1e4853dc220816 | ||
---|---|---|---|
Product: | xorg | Reporter: | Luke Tidd <lukeisgreat> |
Component: | Driver/nouveau | Assignee: | Nouveau Project <nouveau> |
Status: | RESOLVED MOVED | QA Contact: | Xorg Project Team <xorg-team> |
Severity: | critical | ||
Priority: | medium | CC: | 21Naown, carl.lucas, dimhen, leonard, lukeisgreat, praxy, roucaries.bastien+bugs |
Version: | unspecified | ||
Hardware: | x86-64 (AMD64) | ||
OS: | Linux (All) | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: |
Description
Luke Tidd
2016-09-07 03:39:04 UTC
Created attachment 126260 [details]
newer kernel with regression (no video)
Your 4.6 kernel is being booted with 'nomodeset'. That disables the nouveau kernel driver entirely. Created attachment 126262 [details]
dmesg_4.6.7-rt11-2-rt (without nomodeset)
Of course. Sorry about that. PTAL Created attachment 126266 [details]
dmesg_4.6.4-1-ARCH - in case you'd rather not bother with a realtime kernel
Created attachment 126267 [details]
lspci -v
Hrmph. Well it looks like your GPU just goes nuts. [ 9.933619] nouveau 0000:01:00.0: bus: MMIO write of 00000002 FAULT at 13b154 [ IBUS ] [ 9.936957] nouveau 0000:01:00.0: bus: MMIO write of 00000000 FAULT at 61019c [ IBUS ] followed by a ton more 610xxx errors. 610xxx is PDISPLAY stuff, but 13b154 is in PXBAR.GPC_*. I also can't find any way that we'd be writing to that register, at least nothing too direct. Something we do greatly upsets the chip though. I looked through the nouveau changes between 4.5 and 4.6, and while there was a ton of stuff, nothing jumps out at me. Can you do a git bisect on the kernel between v4.5 and v4.6 (you can restrict it to drivers/gpu/drm/nouveau) to figure out which change killed it? Separately, you could see if it magically got fixed in v4.7 or v4.8-rcN. FYI a newer kernel (linux-git 4.8rc4.r0.g3eab887-1) didn't help. I also knocked my machine unreachable by trying to `git bisect` without specifying the limiting path. I'll test more when I get home from work and can power cycle the machine. Hi! Sorry that took so long. My other machine didn't have enough memory to run git bisect on the kernel. Here's the goods. Any change marked bad exhibited the exact same symptoms (monitor turns off shortly after boot). git bisect start '--' 'drivers/gpu/drm/nouveau' # good: [b562e44f507e863c6792946e4e1b1449fbbac85d] Linux 4.5 git bisect good b562e44f507e863c6792946e4e1b1449fbbac85d # bad: [523d939ef98fd712632d93a5a2b588e477a7565e] Linux 4.7 git bisect bad 523d939ef98fd712632d93a5a2b588e477a7565e # bad: [3a91b9c5efd27729767edfde9df069aa61c4816f] drm/nouveau/clk/gk20a: fix VCO bit mask git bisect bad 3a91b9c5efd27729767edfde9df069aa61c4816f # bad: [a2e435a1b0a3c2bc766d40356151610cc54b8772] drm/nouveau/fifo/gk104: take runlist target into account git bisect bad a2e435a1b0a3c2bc766d40356151610cc54b8772 # bad: [18cd5bc8ea587dc2fc0c07d2a4bf3cfe9ed2ef53] drm/nouveau/gr/gf100: load firmware in outer function git bisect bad 18cd5bc8ea587dc2fc0c07d2a4bf3cfe9ed2ef53 # good: [1b82111faebc24427c76b83738566bda7f315225] drm/nouveau/device/tegra: fix uninitialized IRQ number git bisect good 1b82111faebc24427c76b83738566bda7f315225 # bad: [989f57847396d1d042204747985d6aacf5399c8a] drm/nouveau/bios/devinit: rename INIT_DP_CONDITION to INIT_GENERIC_CONDITION git bisect bad 989f57847396d1d042204747985d6aacf5399c8a # good: [8fb1240a7152d450d57402b5b85ba46d8610d443] drm/nouveau/devinit/nv50: remove unneeded variable git bisect good 8fb1240a7152d450d57402b5b85ba46d8610d443 # bad: [96aedd0ba9122a13fd0f756e022856ce7f05f086] drm/nouveau/ltc/gm107: fix slice intr offset git bisect bad 96aedd0ba9122a13fd0f756e022856ce7f05f086 (In reply to Luke Tidd from comment #9) > Hi! Sorry that took so long. My other machine didn't have enough memory to > run git bisect on the kernel. > > > Here's the goods. Any change marked bad exhibited the exact same symptoms > (monitor turns off shortly after boot). Looks like you didn't finish? At the end it should say "commit XYZ is the first bad commit" You are right, sorry new to git bisect. Was thrown off by "0 more steps to do". git bisect start '--' 'drivers/gpu/drm/nouveau' # good: [b562e44f507e863c6792946e4e1b1449fbbac85d] Linux 4.5 git bisect good b562e44f507e863c6792946e4e1b1449fbbac85d # bad: [523d939ef98fd712632d93a5a2b588e477a7565e] Linux 4.7 git bisect bad 523d939ef98fd712632d93a5a2b588e477a7565e # bad: [3a91b9c5efd27729767edfde9df069aa61c4816f] drm/nouveau/clk/gk20a: fix VCO bit mask git bisect bad 3a91b9c5efd27729767edfde9df069aa61c4816f # bad: [a2e435a1b0a3c2bc766d40356151610cc54b8772] drm/nouveau/fifo/gk104: take runlist target into account git bisect bad a2e435a1b0a3c2bc766d40356151610cc54b8772 # bad: [18cd5bc8ea587dc2fc0c07d2a4bf3cfe9ed2ef53] drm/nouveau/gr/gf100: load firmware in outer function git bisect bad 18cd5bc8ea587dc2fc0c07d2a4bf3cfe9ed2ef53 # good: [1b82111faebc24427c76b83738566bda7f315225] drm/nouveau/device/tegra: fix uninitialized IRQ number git bisect good 1b82111faebc24427c76b83738566bda7f315225 # bad: [989f57847396d1d042204747985d6aacf5399c8a] drm/nouveau/bios/devinit: rename INIT_DP_CONDITION to INIT_GENERIC_CONDITION git bisect bad 989f57847396d1d042204747985d6aacf5399c8a # good: [8fb1240a7152d450d57402b5b85ba46d8610d443] drm/nouveau/devinit/nv50: remove unneeded variable git bisect good 8fb1240a7152d450d57402b5b85ba46d8610d443 # bad: [96aedd0ba9122a13fd0f756e022856ce7f05f086] drm/nouveau/ltc/gm107: fix slice intr offset git bisect bad 96aedd0ba9122a13fd0f756e022856ce7f05f086 # bad: [a6a0f67ca7aae2e6bec7ebf55d1e4853dc220816] drm/nouveau/devinit/gf100-: detect if BIOS invoked devinit git bisect bad a6a0f67ca7aae2e6bec7ebf55d1e4853dc220816 # first bad commit: [a6a0f67ca7aae2e6bec7ebf55d1e4853dc220816] drm/nouveau/devinit/gf100-: detect if BIOS invoked devinit (In reply to Luke Tidd from comment #11) > # first bad commit: [a6a0f67ca7aae2e6bec7ebf55d1e4853dc220816] > drm/nouveau/devinit/gf100-: detect if BIOS invoked devinit That's interesting. It either used to run init tables and now doesn't, or conversely didn't use to and now does. Would you mind running a "working" kernel with nouveau.debug=debug and see if it ever says "bios: running init tables"? If it does, you can probably make the new kernels work by adding nouveau.config=NvForcePost=1 . If it previously didn't run the tables and now does, there's no config-line way of overriding it. Another thing you could do is boot (any) kernel with nomodeset, and then using envytools (https://github.com/envytools/envytools/), do nvapeek 2240c This should output 1 line, either a number or just "..." [which means "0"], I would like to know what it is. Ideally this would be done on a cold boot. kernel 4.7.4-1-ARCH nomodeset single # nvapeek 2240c ... testing working kernel now. (In reply to Luke Tidd from comment #14) > kernel 4.7.4-1-ARCH > nomodeset single > > # nvapeek 2240c > ... > > > testing working kernel now. Hm, that means that the bit we're checking is 0, which indicates that the VBIOS hasn't run. This is, of course, patently false, since it seems like this is the only video card in the system, and it's working (right? you're at the console, the screen is on, etc?) And, I'm guessing, in the new kernel we're now running the vbios whereas we previously didn't use to. And running it a second time destroys the universe for some reason. Separately, I just happened to notice that you have vesafb turned on. I wonder if this somehow upsets matters. Any chance you could flip that off? (In reply to Ilia Mirkin from comment #15) > (In reply to Luke Tidd from comment #14) > > kernel 4.7.4-1-ARCH > > nomodeset single > > > > # nvapeek 2240c > > ... > > > > > > testing working kernel now. > > Hm, that means that the bit we're checking is 0, which indicates that the > VBIOS hasn't run. This is, of course, patently false, since it seems like > this is the only video card in the system, and it's working (right? you're > at the console, the screen is on, etc?) Yes, working, screen is on, this is the only video card. > > And, I'm guessing, in the new kernel we're now running the vbios whereas we > previously didn't use to. And running it a second time destroys the universe > for some reason. > > Separately, I just happened to notice that you have vesafb turned on. I > wonder if this somehow upsets matters. Any chance you could flip that off? Sure, just a sec Created attachment 126609 [details]
dmesg 4.5.0-rc7 nouveau.debug=debug log_buf_len=1M single
(In reply to Ilia Mirkin from comment #12) > (In reply to Luke Tidd from comment #11) > > # first bad commit: [a6a0f67ca7aae2e6bec7ebf55d1e4853dc220816] > > drm/nouveau/devinit/gf100-: detect if BIOS invoked devinit > > That's interesting. It either used to run init tables and now doesn't, or > conversely didn't use to and now does. > > Would you mind running a "working" kernel with nouveau.debug=debug and see > if it ever says "bios: running init tables"? If it does, you can probably > make the new kernels work by adding nouveau.config=NvForcePost=1 . If it > previously didn't run the tables and now does, there's no config-line way of > overriding it. https://bugs.freedesktop.org/attachment.cgi?id=126609 I don't see "bios: running init tables" but here is the entire dmesg. Created attachment 126610 [details]
4.5.0-rc7 nouveau.debug=debug video=vesafb:off log_buf_len=1M
(In reply to Luke Tidd from comment #18) > https://bugs.freedesktop.org/attachment.cgi?id=126609 > I don't see "bios: running init tables" but here is the entire dmesg. Indeed. So the "working" state is to not run the VBIOS. Something in there is upsetting matters (normally running the VBIOS a second time doesn't hurt anything, it's just pointless work and can cause a flicker). Assuming that disabling vesafb doesn't help (the way those legacy interfaces are implemented can be pretty brittle), please include (a) /sys/kernel/debug/dri/0/vbios.rom (from any boot) (b) nouveau.debug=debug,bios=trace output from a *broken* kernel boot Ideally that should provide some further things to look at. # stat /sys/kernel/debug/dri/0/vbios.rom File: '/sys/kernel/debug/dri/0/vbios.rom' Size: 0 Blocks: 0 IO Block: 4096 regular empty file Device: 7h/7d Inode: 13648 Links: 1 Access: (0444/-r--r--r--) Uid: ( 0/ root) Gid: ( 0/ root) Access: 2016-09-18 23:52:15.396661809 -0400 Modify: 2016-09-18 23:52:15.396661809 -0400 Change: 2016-09-18 23:52:15.396661809 -0400 Birth: - dri has /0 /64 and /128 and all vbios.rom in each are empty. (In reply to Luke Tidd from comment #21) > # stat /sys/kernel/debug/dri/0/vbios.rom > File: '/sys/kernel/debug/dri/0/vbios.rom' > Size: 0 Blocks: 0 IO Block: 4096 regular empty file > Device: 7h/7d Inode: 13648 Links: 1 > Access: (0444/-r--r--r--) Uid: ( 0/ root) Gid: ( 0/ root) > Access: 2016-09-18 23:52:15.396661809 -0400 > Modify: 2016-09-18 23:52:15.396661809 -0400 > Change: 2016-09-18 23:52:15.396661809 -0400 > Birth: - > > dri has /0 /64 and /128 and all vbios.rom in each are empty. stat can be deceiving. Try cp or cat. (The vbios.rom files in all 3 should be identical btw.) Created attachment 126611 [details]
(broken) dmesg_4.7.4-1-ARCH nouveau.debug=debug,bios=trace
Created attachment 126612 [details]
vbios.rom
(In reply to Luke Tidd from comment #23) > Created attachment 126611 [details] > (broken) dmesg_4.7.4-1-ARCH nouveau.debug=debug,bios=trace I am very confused by the lack of trace output for the bios execution. That should have worked. It's clearly running the tables... [ 1.960340] nouveau 0000:01:00.0: devinit: running init tables Secondly, one of your bios init scripts contains the command 0x8209: 7a 54 b1 13 00 02 00 00 00 ZM_REG R[0x13b154] = 0x00000002 in init script 3 (starting at 0x8116). This is the cause of [ 4.108908] nouveau 0000:01:00.0: bus: MMIO write of 00000002 FAULT at 13b154 [ IBUS ] However that is only reported when GT214_DISP is being initialized. Again, odd, but I'm not 100% sure how the init scripts are executed. This might have to be one for Ben... 0x13b154 = PXBAR.GPC_UNK1[0x2]+0x54 However GF104 only has 2 GPC's. Odd. At least the MMIO write failure makes sense. I wonder if we're supposed to filter those out. That'd be pretty annoying... Another curious point is that this VBIOS does not write 0x2240c anywhere. I wonder if that only got standardized later on, not starting with GF100. Looking at our VBIOS repo, we have a bunch (but not all) of GF104's that are missing the 0x2240c write as well as one GK107. Every other (GF100+) GPU sets 0x2240c. I guess it's not as reliable as we'd hoped. This is how I checked: for i in nv[cdef]?/*/*.rom nv1??/*/*.rom; do echo $i; nvbios $i |& grep 0x02240c; done And looked for files without a 'NV_REG R[0x02240c] &= 0xfffffffd |= 0x00000002' command. Although that in itself shouldn't be a major issue - rerunning the VBIOS shouldn't cause issues... Looking at a mmiotrace that came with one of those GF104 VBIOSes that are missing a 0x2240c write, it doesn't seem like the VBIOS gets executed there, or at least not that script. It's late and I could have messed up though. Obviously not the "right" fix, but you could probably get going by replacing drivers/gpu/drm/nouveau/nvkm/subdev/devinit/gf100.c: .preinit = gf100_devinit_preinit with .preinit = nv50_devinit_preinit which should get you back to the old logic for determining whether to run init scripts or not. Some symptoms with a GFX970. Something strange is without signed firmware I can get something work but in lower resolution with kernel > 4.5 With signed firmware it crash during kms and black screen. no reboot except with button (In reply to roucaries.bastien+bugs from comment #29) > Some symptoms with a GFX970. > > Something strange is without signed firmware I can get something work but in > lower resolution with kernel > 4.5 > > With signed firmware it crash during kms and black screen. no reboot except > with button This is most likely https://bugs.freedesktop.org/show_bug.cgi?id=94990 Hello, Is there any solution in sight for this? I have this card, and I am affected by this issue. I'm currently using the bandaid suggested by Ilia in Comment #29, but obviously that's less than ideal, and I'd prefer to go back to using my distro's stock kernel. For reference, this particular card seems to have weird issues, even for mama nVidia: https://devtalk.nvidia.com/default/topic/959156/-quot-rminitadapter-failed-quot-with-370-23-but-367-35-works-fine/ No idea of the issue there is related at all, but it's another example of a hard-to-track-down problem that affected ONLY this card. (I'm clucas84 in that thread, btw - it was really frustrating at the time, because I couldn't get X at all, since nvidia was busted, and nouveau was affected by this issue). I'm happy to do any debugging work that you want, including replicating any of the debug dumps already done by Luke Tidd if you think they will help, though I am time-constrained to the weekends for that sort of thing. Otherwise, I think I'll give up on this card and get a new one - it's served me well, but it seems like it's a bit too quirky. In any case, thank you very much for your time, and everyone's excellent work on the nouveau driver! Best, Carl If a developer is interested in this card, please contact me. You can have it, I replaced it as I rely on this machine. Could someone see if https://patchwork.freedesktop.org/patch/135772/ improves the situation? (In reply to Ilia Mirkin from comment #33) > Could someone see if https://patchwork.freedesktop.org/patch/135772/ > improves the situation? Yes it does - I was able to boot just fine. I did have to add #include <subdev/vga.h> though, for nvkm_rdvgac(). (In reply to Carl Lucas from comment #34) > (In reply to Ilia Mirkin from comment #33) > > Could someone see if https://patchwork.freedesktop.org/patch/135772/ > > improves the situation? > > Yes it does - I was able to boot just fine. > > I did have to add #include <subdev/vga.h> though, for nvkm_rdvgac(). Right, of course, I had that change locally but forgot to stick it into the commit. I'll send a v2 and cc you. Are you able to grab a mmiotrace[1] of the NVIDIA binary driver initialising this GPU for me please? Ben. [1] https://nouveau.freedesktop.org/wiki/MmioTrace/ (In reply to Ben Skeggs from comment #36) > Are you able to grab a mmiotrace[1] of the NVIDIA binary driver initialising > this GPU for me please? > > Ben. > > [1] https://nouveau.freedesktop.org/wiki/MmioTrace/ Absolutely - I probably won't be able to get to it until this Saturday (2/4) though. I'll post the results once I've got them. Hello, Do you have some information about the progress of the fix? -- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/xorg/driver/xf86-video-nouveau/issues/286. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.