Created attachment 77445 [details] systemd journal with kernel messages Hardware: NVIDIA Corporation G96 [Quadro FX 580] (rev a1) Distro: Arch Linux Bisected commit: http://cgit.freedesktop.org/nouveau/linux-2.6/commit/?id=2c1a425e7d537ea655fa276f267c0b1cd100ff34 Steps to reproduce: 1. Boot system into text mode with normal kernel parameters. 2. Wait for module to load and attempt KMS. Expected behavior: 3. KMS succeeds, login prompt appears. Actual behavior: 3. System locks up and monitors go into power-saving mode. More info: * With nouveau.modeset=0, system boots normally. * Was able to use SysRq to sync disks after hang but could not ssh, ping, or reboot. * Regression still present in current nvidia-git (db1b0f52) as well as kernel.org git. * nvidia module is not installed. lspci.log and vbios.rom forthcoming.
Created attachment 77446 [details] lspci -vv
Created attachment 77447 [details] vbios.rom from debugfs
(Sorry - not "nvidia-git" in the description, I meant to type "Regression still present in current nouveau-git.")
Is there any further information or testing that I can provide to help resolve this issue?
"nouveau E[ PBUS][0000:01:00.0] MMIO read of 0xFOO FAULT at 0xBAR" 01:00.0 VGA compatible controller [0300]: NVIDIA Corporation ION VGA [10de:087d] (rev b1) 3.9.2-200.fc18.x86_64 .. fb: conflicting fb hw usage nouveaufb vs VESA VGA - removing generic driver nouveau 0000:01:00.0: setting latency timer to 64 nouveau [ DEVICE][0000:01:00.0] BOOT0 : 0x0ace80b1 nouveau [ DEVICE][0000:01:00.0] Chipset: MCP79/MCP7A (NVAC) nouveau [ DEVICE][0000:01:00.0] Family : NV50 nouveau [ VBIOS][0000:01:00.0] checking PRAMIN for image... nouveau [ VBIOS][0000:01:00.0] ... appears to be valid nouveau [ VBIOS][0000:01:00.0] using image from PRAMIN nouveau [ VBIOS][0000:01:00.0] BIT signature found nouveau [ VBIOS][0000:01:00.0] version 62.79.78.00.00 nouveau [ PFB][0000:01:00.0] RAM type: stolen system memory nouveau [ PFB][0000:01:00.0] RAM size: 256 MiB nouveau [ PFB][0000:01:00.0] ZCOMP: 0 tags nouveau [ PTHERM][0000:01:00.0] FAN control: none / external nouveau [ PTHERM][0000:01:00.0] fan management: disabled nouveau [ PTHERM][0000:01:00.0] internal sensor: yes nouveau [ PTHERM][0000:01:00.0] Programmed thresholds [ 90(3), 95(3), 95(2), 135(5) ] nouveau [ DRM] VRAM: 256 MiB nouveau [ DRM] GART: 512 MiB nouveau [ DRM] TMDS table version 2.0 nouveau [ DRM] DCB version 4.0 nouveau [ DRM] DCB outp 00: 02000300 0000001e nouveau [ DRM] DCB outp 01: 01011322 00000030 nouveau [ DRM] DCB outp 02: 02022332 00020010 nouveau [ DRM] DCB conn 00: 00000000 nouveau [ DRM] DCB conn 01: 00001131 nouveau [ DRM] DCB conn 02: 00002261 *************************************************************************** nouveau E[ PBUS][0000:01:00.0] MMIO read of 0x00030003 FAULT at 0x100220 nouveau E[ PBUS][0000:01:00.0] MMIO read of 0x00000008 FAULT at 0x100228 nouveau E[ PBUS][0000:01:00.0] MMIO read of 0x00000008 FAULT at 0x10022c nouveau E[ PBUS][0000:01:00.0] MMIO read of 0x00000008 FAULT at 0x100234 nouveau E[ PBUS][0000:01:00.0] MMIO read of 0x00000008 FAULT at 0x100240 nouveau E[ PBUS][0000:01:00.0] MMIO read of 0x00000008 FAULT at 0x1002c0 nouveau E[ PBUS][0000:01:00.0] MMIO read of 0x00000008 FAULT at 0x1002c4 nouveau E[ PBUS][0000:01:00.0] MMIO read of 0x00000008 FAULT at 0x1002e4 *************************************************************************** nouveau [ DRM] 4 available performance level(s) nouveau [ DRM] 0: core 200MHz shader 400MHz voltage 1010mV fanspeed 60% nouveau [ DRM] 1: core 300MHz shader 600MHz voltage 1010mV fanspeed 100% nouveau [ DRM] 2: core 350MHz shader 800MHz voltage 1010mV fanspeed 100% nouveau [ DRM] 3: core 450MHz shader 1100MHz voltage 1010mV fanspeed 100% nouveau [ DRM] c: nouveau [ DRM] MM: using M2MF for buffer copies nouveau [ DRM] allocated 800x600 fb: 0x50000, bo ffff8800c81b0800 fbcon: nouveaufb (fb0) is primary device nouveau 0000:01:00.0: fb0: nouveaufb frame buffer device [drm] Initialized nouveau 1.1.0 20120801 for 0000:01:00.0 on minor 1 ..
Created attachment 79921 [details] full syslog with nouveau.debug=debug Attaching syslog with extra debug info, in case this is more useful than the previous attachment. (Note to poma: unless you are a Nouveau dev, gratuitously raising the sev and prio fields to the top is likely to get a bug ignored. Setting back.)
Devin can you confirm that commit 2c1a425e7 really is the first bad one. It adds support of a subdevice/engine that is not present on your system, thus the code is unused in your case Please build kernel based on the parent commit - 757833cc9, and attach debug log - nouveau.debug=debug
Well darn, apparently it isn't. Unfortunately the bug doesn't occur 100% of the time when you boot a bad build, so I'm going to have to re-bisect and do a lot of rebooting. I'll attach a debug log from the last known good build... once I know what that is. :)
Created attachment 80240 [details] debug log from last good commit Whew! Re-bisected, and the first commit I've been able to reproduce the bug on is ebb945a, "port all engines to new engine module format". Attaching a debug log from the last good commit, ac1499d, as requested.
Nooo, why did it have to lead to that commit :( Devin I don't suppose you have a debug dmesg closer to the offending commit, do you ? The delta between these two is quite substantial My initial suspect is that the DP data stored(or it's parsing) in is to blame for your issue Working * Appears to be parsed properly? * unknown opcode 0x44 Failing * nouveau E[ PDISP][0000:03:00.0] DP:0006:0382: bios data not found
s/debug dmesg/failing debug log/
Created attachment 80348 [details] [review] Parse DP table v20 Additionally you can try the attached patch. It will enable your DP table (v20) to be parsed the same way as v21(seen on my nv96 -GT120M) It should clear the "DP:0006:0382: bios data not found" message
I'll see what I can do about getting a log from the first buggy commit. Occasionally when I'm lucky it lets me REISUB. (The obsoleted attachment was from a much closer commit, but was without nouveau.debug=debug, if that helps at all.)
Created attachment 80439 [details] log from last good rev
Created attachment 80440 [details] log from first bad rev, non-bugged boot
Created attachment 80441 [details] log from first bad rev, hanged boot Have some logs. God bless the serial console. Attached one full log from the last good commit, and two logs from the first bad commit: one where the system boots normally, and one where it hangs.
[ 11.902160] RIP: 0010:[<ffffffffa0570160>] [<ffffffffa0570160>] drm_fb_helper_hotplug_event+0x10/0x100 [drm_kms_helper] That seems to imply that drm->fbcon->helper == NULL. But that's not what you're seeing on recent kernels, so the implication is that there's some sort of second concurrency bug in the rev you identified that got fixed later on.... I don't think that your bisect results are valid as a result :( Was the original problem 100% reproducible? If so, you should mark any kernel that works sometimes as "good" (even if it's not really *that* good). Also, the original problem is just monitors not working, but you can ssh in/etc, right? Or does the system totally hang? (Doesn't seem that way given that logs showed up in systemd...)
No, the original problem was not 100% reproducible. It always felt (this is only a guess) like it had something to do with how long the KMS switch took... if it took too long, the monitors went into powersave, and the system would almost certainly hang. There seem to be three possibilities for a boot on a bugged revision: 1. Everything works. 2. Displays go into powersave, system hangs, and you can't ssh or use SysRq. 3. (Rare.) Displays go into powersave but you can still use SysRq. For the latest attachments, I used a serial console to reliably get logs from cases 1 and 2. The original logs (from a 3.9-series kernel) relied on getting case 3 so I could REISUB and read the logs on next boot, but this may not show the bug properly. Would you like me to get logs from a recent kernel using the serial console so you can compare?
Yes, it'd be nice to also get the logs from a recent kernel when it's totally hung, if it's not too difficult.
Huh. Well, I'm building the latest kernel.org git, and I haven't gotten it to hang yet. Displays still power off, but I can ssh and reboot normally. I'll do another bisect with the serial console attached to see if that is more informative. Won't be until next week sometime though. Sorry about the mix-up. (P.S. Emil: the patch you attached, applied to a git kernel, does not appear to fix the "DP:0005:0382: bios data not found" error.)
Created attachment 80796 [details] log from recent git with patch, no hang, no video Well, I haven't gotten a recent kernel to hang. So that's good in a way. Unfortunately the displays still don't work. Attached a log from the serial console, on a recent git kernel with the "Parse DP table v20" patch applied. If I will need to re-bisect, let me know what I should look for as a determining factor for good/bad.
You know, this warning in your dmesg doesn't fill me with confidence: [ 0.188533] WARNING: at drivers/iommu/intel_irq_remapping.c:533 intel_irq_remapping_supported+0x35/0x78() [ 0.198212] This system BIOS has enabled interrupt remapping [ 0.198212] on a chipset that contains an erratum making that [ 0.198212] feature unstable. To maintain system stability [ 0.198212] interrupt remapping is being disabled. Please [ 0.198212] contact your BIOS vendor for an update Can you see if you can disable DMAR in your bios? I'm sure it's not related, but... who knows. Secondly, your log appears to only have errors/warns, not info prints. Would be good to get the full log. Is that patch still needed BTW? What happens without it? These I don't like: [ 9.190684] nouveau E[ PDISP][0000:03:00.0] DP:0006:0382: bios data not found [ 9.191377] nouveau E[ PBUS][0000:03:00.0] MMIO read of 0x03854080 FAULT at 0x614b00 [ 9.192380] nouveau E[ PBUS][0000:03:00.0] MMIO read of 0x00000090 FAULT at 0x61c9a8 [ 9.193055] nouveau E[ PBUS][0000:03:00.0] MMIO read of 0x6df3fe24 FAULT at 0x614200 [ 9.193778] nouveau E[ PBUS][0000:03:00.0] MMIO read of 0x00000000 FAULT at 0x614b00 [ 9.194453] nouveau E[ PBUS][0000:03:00.0] MMIO write of 0x80000000 FAULT at 0x610030 [ 9.385564] nouveau E[ PBUS][0000:03:00.0] MMIO read of 0x00100000 FAULT at 0x61c804 [ 9.385915] nouveau E[ PBUS][0000:03:00.0] MMIO read of 0x00000008 FAULT at 0x61c804 [ 9.386332] nouveau E[ PBUS][0000:03:00.0] MMIO write of 0x80000000 FAULT at 0x61c804 [ 9.386676] nouveau E[ PBUS][0000:03:00.0] MMIO read of 0x00000008 FAULT at 0x61c804 [ 9.387021] nouveau E[ PBUS][0000:03:00.0] MMIO read of 0x00000008 FAULT at 0x61c830 [ 11.168116] nouveau E[ PBUS][0000:03:00.0] MMIO write of 0x80000001 FAULT at 0x61c804 [ 11.168794] nouveau E[ PBUS][0000:03:00.0] MMIO read of 0x00000008 FAULT at 0x61c830 [ 11.169464] nouveau E[ PBUS][0000:03:00.0] MMIO write of 0x00000020 FAULT at 0x642000 [ 11.171770] nouveau E[ PBUS][0000:03:00.0] MMIO read of 0x0000003c FAULT at 0x640000 [ 11.174084] nouveau E[ PBUS][0000:03:00.0] MMIO read of 0x00000028 FAULT at 0x640000 [ 11.175080] nouveau E[ PBUS][0000:03:00.0] MMIO read of 0x80000001 FAULT at 0x61d004 [ 11.177067] nouveau E[ PBUS][0000:03:00.0] MMIO read of 0x0000001c FAULT at 0x640000 [ 11.178719] nouveau E[ PBUS][0000:03:00.0] MMIO write of 0x00000008 FAULT at 0x640000 [ 11.179107] nouveau E[ PBUS][0000:03:00.0] MMIO write of 0x00000014 FAULT at 0x640000 [ 13.178552] nouveau E[ PBUS][0000:03:00.0] MMIO write of 0x0000006c FAULT at 0x642000 And they're probably the source of the problem. BTW, it's worth asking -- are there actual DP connectors on your card, and/or are you using adapters of some sort?
Created attachment 84674 [details] log, recent git, nouveau.debug=debug, kms fails, no hang Here's a new log from today's kernel.org git. Somehow I neglected to set nouveau.debug=debug on that last one; sorry about that! I checked my BIOS setup, but there wasn't anything regarding DMA or IRQ remapping. I have the latest version, but then it's a Dell BIOS which apparently values mouse support over advanced settings. *shrug* I didn't see any change in behavior with the patch, and the "bios data not found" error still appeared even with it, so I haven't been using it. I'm using two DisplayPort outputs on my card, with DisplayPort cables, to two DisplayPort sinks, no adapters anywhere.
Would you mind testing this again with 3.13.x? A few fixes went in, especially one that fixes errors setting up displays in NV96 VBIOSes (for HDMI, but your case could be similar). Assuming it still doesn't work help, could you attach a boot (from 3.13.x) with nouveau.debug=PDISP=debug,VBIOS=trace ? Also, could attach a boot from a good kernel (not necessarily the last good rev) with nouveau.debug=debug drm.debug=0xe (this should cause a VBIOS trace from that kernel as well... if it doesn't, I'll have to go back and re-check the kernel parameters for the older version). (This is banking on something being different wrt the vbios, either execution path or perhaps when/how often it is invoked.) Curiously I don't see any opcode 0x44 in your VBIOS. But I think there are some tables I'm actually missing, so perhaps it's there. That's definitely an unknown opcode though -- old or new kernels.
Er sorry. The patch didn't actually go into 3.13.x -- can you try 3.14-rc1 or drm-next (which should be 3.13 + drm changes).
I will look into this if research/teaching give me a break sometime soon. :) The computer I had been using has been retired, so I'll need time either to reanimate it or get the card and bring it home to test.
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/xorg/driver/xf86-video-nouveau/issues/43.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.