Created attachment 77445 [details]
systemd journal with kernel messages
Hardware: NVIDIA Corporation G96 [Quadro FX 580] (rev a1)
Distro: Arch Linux
Bisected commit: http://cgit.freedesktop.org/nouveau/linux-2.6/commit/?id=2c1a425e7d537ea655fa276f267c0b1cd100ff34
Steps to reproduce:
1. Boot system into text mode with normal kernel parameters.
2. Wait for module to load and attempt KMS.
3. KMS succeeds, login prompt appears.
3. System locks up and monitors go into power-saving mode.
* With nouveau.modeset=0, system boots normally.
* Was able to use SysRq to sync disks after hang but could not ssh, ping, or reboot.
* Regression still present in current nvidia-git (db1b0f52) as well as kernel.org git.
* nvidia module is not installed.
lspci.log and vbios.rom forthcoming.
Created attachment 77446 [details]
Created attachment 77447 [details]
vbios.rom from debugfs
(Sorry - not "nvidia-git" in the description, I meant to type "Regression still present in current nouveau-git.")
Is there any further information or testing that I can provide to help resolve this issue?
"nouveau E[ PBUS][0000:01:00.0] MMIO read of 0xFOO FAULT at 0xBAR"
01:00.0 VGA compatible controller : NVIDIA Corporation ION VGA [10de:087d] (rev b1)
fb: conflicting fb hw usage nouveaufb vs VESA VGA - removing generic driver
nouveau 0000:01:00.0: setting latency timer to 64
nouveau [ DEVICE][0000:01:00.0] BOOT0 : 0x0ace80b1
nouveau [ DEVICE][0000:01:00.0] Chipset: MCP79/MCP7A (NVAC)
nouveau [ DEVICE][0000:01:00.0] Family : NV50
nouveau [ VBIOS][0000:01:00.0] checking PRAMIN for image...
nouveau [ VBIOS][0000:01:00.0] ... appears to be valid
nouveau [ VBIOS][0000:01:00.0] using image from PRAMIN
nouveau [ VBIOS][0000:01:00.0] BIT signature found
nouveau [ VBIOS][0000:01:00.0] version 62.79.78.00.00
nouveau [ PFB][0000:01:00.0] RAM type: stolen system memory
nouveau [ PFB][0000:01:00.0] RAM size: 256 MiB
nouveau [ PFB][0000:01:00.0] ZCOMP: 0 tags
nouveau [ PTHERM][0000:01:00.0] FAN control: none / external
nouveau [ PTHERM][0000:01:00.0] fan management: disabled
nouveau [ PTHERM][0000:01:00.0] internal sensor: yes
nouveau [ PTHERM][0000:01:00.0] Programmed thresholds [ 90(3), 95(3), 95(2), 135(5) ]
nouveau [ DRM] VRAM: 256 MiB
nouveau [ DRM] GART: 512 MiB
nouveau [ DRM] TMDS table version 2.0
nouveau [ DRM] DCB version 4.0
nouveau [ DRM] DCB outp 00: 02000300 0000001e
nouveau [ DRM] DCB outp 01: 01011322 00000030
nouveau [ DRM] DCB outp 02: 02022332 00020010
nouveau [ DRM] DCB conn 00: 00000000
nouveau [ DRM] DCB conn 01: 00001131
nouveau [ DRM] DCB conn 02: 00002261
nouveau E[ PBUS][0000:01:00.0] MMIO read of 0x00030003 FAULT at 0x100220
nouveau E[ PBUS][0000:01:00.0] MMIO read of 0x00000008 FAULT at 0x100228
nouveau E[ PBUS][0000:01:00.0] MMIO read of 0x00000008 FAULT at 0x10022c
nouveau E[ PBUS][0000:01:00.0] MMIO read of 0x00000008 FAULT at 0x100234
nouveau E[ PBUS][0000:01:00.0] MMIO read of 0x00000008 FAULT at 0x100240
nouveau E[ PBUS][0000:01:00.0] MMIO read of 0x00000008 FAULT at 0x1002c0
nouveau E[ PBUS][0000:01:00.0] MMIO read of 0x00000008 FAULT at 0x1002c4
nouveau E[ PBUS][0000:01:00.0] MMIO read of 0x00000008 FAULT at 0x1002e4
nouveau [ DRM] 4 available performance level(s)
nouveau [ DRM] 0: core 200MHz shader 400MHz voltage 1010mV fanspeed 60%
nouveau [ DRM] 1: core 300MHz shader 600MHz voltage 1010mV fanspeed 100%
nouveau [ DRM] 2: core 350MHz shader 800MHz voltage 1010mV fanspeed 100%
nouveau [ DRM] 3: core 450MHz shader 1100MHz voltage 1010mV fanspeed 100%
nouveau [ DRM] c:
nouveau [ DRM] MM: using M2MF for buffer copies
nouveau [ DRM] allocated 800x600 fb: 0x50000, bo ffff8800c81b0800
fbcon: nouveaufb (fb0) is primary device
nouveau 0000:01:00.0: fb0: nouveaufb frame buffer device
[drm] Initialized nouveau 1.1.0 20120801 for 0000:01:00.0 on minor 1
Created attachment 79921 [details]
full syslog with nouveau.debug=debug
Attaching syslog with extra debug info, in case this is more useful than the previous attachment.
(Note to poma: unless you are a Nouveau dev, gratuitously raising the sev and prio fields to the top is likely to get a bug ignored. Setting back.)
Devin can you confirm that commit 2c1a425e7 really is the first bad one. It adds support of a subdevice/engine that is not present on your system, thus the code is unused in your case
Please build kernel based on the parent commit - 757833cc9, and attach debug log - nouveau.debug=debug
Well darn, apparently it isn't. Unfortunately the bug doesn't occur 100% of the time when you boot a bad build, so I'm going to have to re-bisect and do a lot of rebooting.
I'll attach a debug log from the last known good build... once I know what that is. :)
Created attachment 80240 [details]
debug log from last good commit
Whew! Re-bisected, and the first commit I've been able to reproduce the bug on is ebb945a, "port all engines to new engine module format".
Attaching a debug log from the last good commit, ac1499d, as requested.
Nooo, why did it have to lead to that commit :(
Devin I don't suppose you have a debug dmesg closer to the offending commit, do you ? The delta between these two is quite substantial
My initial suspect is that the DP data stored(or it's parsing) in is to blame for your issue
* Appears to be parsed properly?
* unknown opcode 0x44
* nouveau E[ PDISP][0000:03:00.0] DP:0006:0382: bios data not found
s/debug dmesg/failing debug log/
Created attachment 80348 [details] [review]
Parse DP table v20
Additionally you can try the attached patch. It will enable your DP table (v20) to be parsed the same way as v21(seen on my nv96 -GT120M)
It should clear the "DP:0006:0382: bios data not found" message
I'll see what I can do about getting a log from the first buggy commit. Occasionally when I'm lucky it lets me REISUB.
(The obsoleted attachment was from a much closer commit, but was without nouveau.debug=debug, if that helps at all.)
Created attachment 80439 [details]
log from last good rev
Created attachment 80440 [details]
log from first bad rev, non-bugged boot
Created attachment 80441 [details]
log from first bad rev, hanged boot
Have some logs. God bless the serial console.
Attached one full log from the last good commit, and two logs from the first bad commit: one where the system boots normally, and one where it hangs.
[ 11.902160] RIP: 0010:[<ffffffffa0570160>] [<ffffffffa0570160>] drm_fb_helper_hotplug_event+0x10/0x100 [drm_kms_helper]
That seems to imply that drm->fbcon->helper == NULL. But that's not what you're seeing on recent kernels, so the implication is that there's some sort of second concurrency bug in the rev you identified that got fixed later on.... I don't think that your bisect results are valid as a result :(
Was the original problem 100% reproducible? If so, you should mark any kernel that works sometimes as "good" (even if it's not really *that* good).
Also, the original problem is just monitors not working, but you can ssh in/etc, right? Or does the system totally hang? (Doesn't seem that way given that logs showed up in systemd...)
No, the original problem was not 100% reproducible. It always felt (this is only a guess) like it had something to do with how long the KMS switch took... if it took too long, the monitors went into powersave, and the system would almost certainly hang.
There seem to be three possibilities for a boot on a bugged revision:
1. Everything works.
2. Displays go into powersave, system hangs, and you can't ssh or use SysRq.
3. (Rare.) Displays go into powersave but you can still use SysRq.
For the latest attachments, I used a serial console to reliably get logs from cases 1 and 2. The original logs (from a 3.9-series kernel) relied on getting case 3 so I could REISUB and read the logs on next boot, but this may not show the bug properly. Would you like me to get logs from a recent kernel using the serial console so you can compare?
Yes, it'd be nice to also get the logs from a recent kernel when it's totally hung, if it's not too difficult.
Huh. Well, I'm building the latest kernel.org git, and I haven't gotten it to hang yet. Displays still power off, but I can ssh and reboot normally.
I'll do another bisect with the serial console attached to see if that is more informative. Won't be until next week sometime though. Sorry about the mix-up.
(P.S. Emil: the patch you attached, applied to a git kernel, does not appear to fix the "DP:0005:0382: bios data not found" error.)
Created attachment 80796 [details]
log from recent git with patch, no hang, no video
Well, I haven't gotten a recent kernel to hang. So that's good in a way. Unfortunately the displays still don't work.
Attached a log from the serial console, on a recent git kernel with the "Parse DP table v20" patch applied. If I will need to re-bisect, let me know what I should look for as a determining factor for good/bad.
You know, this warning in your dmesg doesn't fill me with confidence:
[ 0.188533] WARNING: at drivers/iommu/intel_irq_remapping.c:533 intel_irq_remapping_supported+0x35/0x78()
[ 0.198212] This system BIOS has enabled interrupt remapping
[ 0.198212] on a chipset that contains an erratum making that
[ 0.198212] feature unstable. To maintain system stability
[ 0.198212] interrupt remapping is being disabled. Please
[ 0.198212] contact your BIOS vendor for an update
Can you see if you can disable DMAR in your bios? I'm sure it's not related, but... who knows.
Secondly, your log appears to only have errors/warns, not info prints. Would be good to get the full log. Is that patch still needed BTW? What happens without it?
These I don't like:
[ 9.190684] nouveau E[ PDISP][0000:03:00.0] DP:0006:0382: bios data not found
[ 9.191377] nouveau E[ PBUS][0000:03:00.0] MMIO read of 0x03854080 FAULT at 0x614b00
[ 9.192380] nouveau E[ PBUS][0000:03:00.0] MMIO read of 0x00000090 FAULT at 0x61c9a8
[ 9.193055] nouveau E[ PBUS][0000:03:00.0] MMIO read of 0x6df3fe24 FAULT at 0x614200
[ 9.193778] nouveau E[ PBUS][0000:03:00.0] MMIO read of 0x00000000 FAULT at 0x614b00
[ 9.194453] nouveau E[ PBUS][0000:03:00.0] MMIO write of 0x80000000 FAULT at 0x610030
[ 9.385564] nouveau E[ PBUS][0000:03:00.0] MMIO read of 0x00100000 FAULT at 0x61c804
[ 9.385915] nouveau E[ PBUS][0000:03:00.0] MMIO read of 0x00000008 FAULT at 0x61c804
[ 9.386332] nouveau E[ PBUS][0000:03:00.0] MMIO write of 0x80000000 FAULT at 0x61c804
[ 9.386676] nouveau E[ PBUS][0000:03:00.0] MMIO read of 0x00000008 FAULT at 0x61c804
[ 9.387021] nouveau E[ PBUS][0000:03:00.0] MMIO read of 0x00000008 FAULT at 0x61c830
[ 11.168116] nouveau E[ PBUS][0000:03:00.0] MMIO write of 0x80000001 FAULT at 0x61c804
[ 11.168794] nouveau E[ PBUS][0000:03:00.0] MMIO read of 0x00000008 FAULT at 0x61c830
[ 11.169464] nouveau E[ PBUS][0000:03:00.0] MMIO write of 0x00000020 FAULT at 0x642000
[ 11.171770] nouveau E[ PBUS][0000:03:00.0] MMIO read of 0x0000003c FAULT at 0x640000
[ 11.174084] nouveau E[ PBUS][0000:03:00.0] MMIO read of 0x00000028 FAULT at 0x640000
[ 11.175080] nouveau E[ PBUS][0000:03:00.0] MMIO read of 0x80000001 FAULT at 0x61d004
[ 11.177067] nouveau E[ PBUS][0000:03:00.0] MMIO read of 0x0000001c FAULT at 0x640000
[ 11.178719] nouveau E[ PBUS][0000:03:00.0] MMIO write of 0x00000008 FAULT at 0x640000
[ 11.179107] nouveau E[ PBUS][0000:03:00.0] MMIO write of 0x00000014 FAULT at 0x640000
[ 13.178552] nouveau E[ PBUS][0000:03:00.0] MMIO write of 0x0000006c FAULT at 0x642000
And they're probably the source of the problem. BTW, it's worth asking -- are there actual DP connectors on your card, and/or are you using adapters of some sort?
Created attachment 84674 [details]
log, recent git, nouveau.debug=debug, kms fails, no hang
Here's a new log from today's kernel.org git. Somehow I neglected to set nouveau.debug=debug on that last one; sorry about that!
I checked my BIOS setup, but there wasn't anything regarding DMA or IRQ remapping. I have the latest version, but then it's a Dell BIOS which apparently values mouse support over advanced settings. *shrug*
I didn't see any change in behavior with the patch, and the "bios data not found" error still appeared even with it, so I haven't been using it.
I'm using two DisplayPort outputs on my card, with DisplayPort cables, to two DisplayPort sinks, no adapters anywhere.
Would you mind testing this again with 3.13.x? A few fixes went in, especially one that fixes errors setting up displays in NV96 VBIOSes (for HDMI, but your case could be similar).
Assuming it still doesn't work help, could you attach a boot (from 3.13.x) with nouveau.debug=PDISP=debug,VBIOS=trace ?
Also, could attach a boot from a good kernel (not necessarily the last good rev) with nouveau.debug=debug drm.debug=0xe (this should cause a VBIOS trace from that kernel as well... if it doesn't, I'll have to go back and re-check the kernel parameters for the older version).
(This is banking on something being different wrt the vbios, either execution path or perhaps when/how often it is invoked.)
Curiously I don't see any opcode 0x44 in your VBIOS. But I think there are some tables I'm actually missing, so perhaps it's there. That's definitely an unknown opcode though -- old or new kernels.
Er sorry. The patch didn't actually go into 3.13.x -- can you try 3.14-rc1 or drm-next (which should be 3.13 + drm changes).
I will look into this if research/teaching give me a break sometime soon. :) The computer I had been using has been retired, so I'll need time either to reanimate it or get the card and bring it home to test.