x86_64 system running Ubuntu 6.06 fails to start X server when a Hypertransport device in the system advertises a 64-bit pci memory BAR sized larger than or equal to 4GB. When the BAR size is 2GB then the Xserver runs.
In xf86str.h: pciVideoRec has a field "int size[6]" should be "unsigned long size[6]"
Nope, looks like the size is tracked in terms of bits so "int" works there.
Onboard ATI Rage XL chip support works. However, Nvidia binary driver version 9631 causes board reset even with KDB enabled in kernel.
0000:02:00.0 VGA compatible controller: nVidia Corporation NV43 [GeForce 6600] (rev a2) (prog-if 00 [VGA]) Subsystem: ASUSTeK Computer Inc.: Unknown device 81b0 Flags: bus master, fast devsel, latency 0, IRQ 18 Memory at f0000000 (32-bit, non-prefetchable) [size=64M] Memory at c000000000 (64-bit, prefetchable) [size=256M] Memory at f4000000 (64-bit, non-prefetchable) [size=16M] Expansion ROM at f5000000 [disabled] [size=128K] Capabilities: [60] Power Management version 2 Capabilities: [68] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable- Capabilities: [78] #10 [0001] console log shows after running startx: (WW) ****INVALID MEM ALLOCATION**** b: 0xc000000000 e: 0xc00fffffff correcting Requesting insufficient memory window!: start: 0x0 end: 0xfffffff size 0xc010000000 Requesting insufficient memory window!: start: 0xf0000000 end: 0xf50fffff size 0xc010000000 (EE) Cannot find a replacement memory range
Created attachment 8477 [details] Xserver log file when BAR size is 512MB Even though the Xserver and Nvidia driver appear to work, you can see a problem in the log file where the Nvidia framebuffer region 0xfce0000000 shows up at 0xe0000000 after the mem window is suppoedly "fixed" by the Xserver
could be a problem in memory window handling when 40-bit addresses are involved
lspci -xxx -s 2:0 before running startx 02:00.0 VGA compatible controller: nVidia Corporation NV43 [GeForce 6600] (rev a2) 00: de 10 41 01 47 01 10 00 a2 00 00 03 10 00 00 00 10: 00 00 00 f0 0c 00 00 e0 fc 00 00 00 04 00 00 f4 20: 00 00 00 00 00 00 00 00 00 00 00 00 43 10 b0 81 30: 00 00 00 f5 60 00 00 00 00 00 00 00 00 01 00 00 40: 43 10 b0 81 00 00 00 00 00 00 00 00 00 00 00 00 50: 01 00 00 00 01 00 00 00 ce d6 23 00 00 00 00 00 60: 01 68 02 00 00 00 00 00 05 78 80 00 00 00 00 00 70: 00 00 00 00 00 00 00 00 10 00 01 00 c0 04 00 00 80: 10 28 00 00 01 4d 01 00 08 00 01 11 00 00 00 00 90: 00 00 00 00 00 00 00 00 0c 08 40 c1 01 04 40 c1 a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 lspci output after running startx: 00: de 10 41 01 47 01 10 00 a2 00 00 03 00 00 00 00 10: 00 00 00 f0 0c 00 00 e0 00 00 00 00 04 00 00 f4 20: 00 00 00 00 00 00 00 00 00 00 00 00 43 10 b0 81 30: 00 00 00 00 60 00 00 00 00 00 00 00 00 01 00 00 40: 43 10 b0 81 00 00 00 00 00 00 00 00 00 00 00 00 50: 01 00 00 00 01 00 00 00 ce d6 23 00 00 00 00 00 60: 01 68 02 00 00 00 00 00 05 78 80 00 00 00 00 00 70: 00 00 00 00 00 00 00 00 10 00 01 00 c0 04 00 00 80: 10 28 00 00 01 4d 01 00 08 00 01 11 00 00 00 00 90: 00 00 00 00 00 00 00 00 0c 08 40 c1 01 04 40 c1 a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 c0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 d0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 You can see that the framebuffer BAR has been overwritten by Xorg from 0xfce0000000 -> 0x00e0000000 Here are the MTRR registers: reg00: base=0x00000000 ( 0MB), size=4096MB: write-back, count=1 reg01: base=0x100000000 (4096MB), size=1024MB: write-back, count=1 reg02: base=0xc0000000 (3072MB), size=1024MB: uncachable, count=1 so one reason why this doesn't immediately kill the machine is that it just happens I presume that the truncated address falls into the MMIO hole in the 3-4GB range and no other device is using that truncated address range. Here is the Opteron's address map: 00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map 00: 22 10 01 11 00 00 00 00 00 00 00 06 00 00 80 00 10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 20: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 30: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 40: 03 00 00 00 00 00 3f 01 00 00 00 00 01 00 00 00 50: 00 00 00 00 02 00 00 00 00 00 00 00 03 00 00 00 60: 00 00 00 00 04 00 00 00 00 00 00 00 05 00 00 00 70: 00 00 00 00 06 00 00 00 00 00 00 00 07 00 00 00 80: 00 00 00 00 00 00 00 00 03 0a 00 00 00 0c 00 00 90: 03 00 fc 00 20 1f fc 00 03 00 f0 fc 20 ff ef fc a0: 03 20 fc 00 10 1f fc 00 03 00 c0 fc 10 ff df fc b0: 03 00 f0 00 00 1f f7 00 03 00 e0 fc 00 ff ef fc c0: 30 f0 8f 01 37 e0 fb 01 03 40 00 00 20 30 00 00 d0: 03 40 00 00 10 30 00 00 13 10 00 00 00 30 00 00 e0: 03 00 00 02 03 01 40 40 03 02 80 82 00 00 00 00 f0: 01 40 00 c0 00 00 00 00 00 00 00 00 00 00 00 00
Sorry about the phenomenal bug spam, guys. Adding xorg-team@ to the QA contact so bugs don't get lost in future.
Can you attach the output of 'lspci -v' from a configuration with a >4GB BAR?
Comment #4 shows a case where BAR > 4GB. The Nvidia framebuffer has a 40-bit address 0xc000000000 so that when the Xserver truncates the Nvidia BAR it will be 0x0 and the machine hangs. The BAR > 2G just has the side-effect of increasing the alignment of the Nvidia framebuffer as the algorithm used by Linuxbios to allocate the prefmem64 resources is quite simple.
As stated above, the underlying problem is present even when there is no BAR > 2G is size. dmesg reports the nvidia.ko might also be buggy, NVRM debug output reports an incorrect BAR as well: [ 368.383230] NVRM: probing 0x10de 0x141, class 0x30000 [ 368.383242] PCI: Setting latency timer of device 0000:02:00.0 to 64 [ 368.383365] NVRM: 02:00.0 10de:0141 - 0xf0000000 [size=64M] [ 368.383369] NVRM: 02:00.0 10de:0141 - 0xe0000000 [size=256M] [ 368.383381] NVRM: loading NVIDIA UNIX x86_64 Kernel Module 100.14.11 Wed Jun 13 16:33:22 PDT 2007 [ 368.383410] saved orig pats as 0x70406 0x70406 [ 368.383798] changed pats to 0x70106 0x70406 It could just be the debug message itself is incorrect and is printing only BAR1 and ignoring BAR2 even though its a 64-bit BAR. lspci reports: 02:00.0 VGA compatible controller: nVidia Corporation NV43 [GeForce 6600] (rev a2) (prog-if 00 [VGA]) Subsystem: ASUSTeK Computer Inc. Unknown device 81b0 Flags: bus master, fast devsel, latency 0, IRQ 18 Memory at f0000000 (32-bit, non-prefetchable) [size=64M] Memory at fce0000000 (64-bit, prefetchable) [size=256M] Memory at f4000000 (64-bit, non-prefetchable) [size=16M] Expansion ROM at f5000000 [disabled] [size=128K] Capabilities: [60] Power Management version 2 Capabilities: [68] Message Signalled Interrupts: Mask- 64bit+ Queue=0/0 Enable- Capabilities: [78] Express Endpoint IRQ 0 Another interesting note is that after starting Xorg, the Xserver rewrites the 64-bit to be 0xe0000000 and then when Xorg is shutdown I've seen the BAR rewritten back to be 0xfce0000000. I'm assuming nvidia.ko is using power management code and is calling pci_restore_state at some point. In other words, the kernel's pci layer still thinks the BAR is 0xfce0000000 and the Opteron's MMIO Address Map register shows (using lspci -xxx -s 0:18.1): b0: 03 00 f0 00 00 1f f7 00 03 00 e0 fc 00 ff ef fc ---------------------------^^^^^^ So that requests to that HT address goes to the correct PCIe bridge (nodeid=0, link id =0) which is just the CK804 Nvidia Southbridge. Also, the bridges prefetchable memory base/limit registers were programmed by the BIOS consistent with that original BAR value: 00:0e.0 PCI bridge: nVidia Corporation CK804 PCIE Bridge (rev a3) (prog-if 00 [Normal decode]) Flags: bus master, fast devsel, latency 0 Bus: primary=00, secondary=02, subordinate=02, sec-latency=0 Memory behind bridge: f0000000-f50fffff Prefetchable memory behind bridge: 000000fce0000000-000000fcefffffff Capabilities: [40] Power Management version 2 Capabilities: [48] Message Signalled Interrupts: Mask- 64bit+ Queue=0/1 Enable- Capabilities: [58] HyperTransport: MSI Mapping Capabilities: [80] Express Root Port (Slot+) IRQ 0
debug output from nvidia kernel driver after running startx: [ 411.314410] NVRM: nv_kern_open... [ 411.314416] NVRM: nv_kern_ctl_open [ 411.314419] NVRM: nv_acpi_init: acpi_bus_register_driver() failed (-19)! [ 411.314422] NVRM: failed to register with the ACPI subsystem! [ 411.314789] NVRM: ioctl(0xd2, 0xd0b0b130, 0x48) [ 411.314804] NVRM: ioctl(0xc8, 0xdd9b2c00, 0xe0) [ 411.314928] NVRM: ioctl(0x22, 0xd0b0b290, 0xc) [ 411.322541] NVRM: ioctl(0x2a, 0xd0b0b1b0, 0x20) [ 411.322640] NVRM: ioctl(0x4d, 0xd0b0b150, 0x40) [ 411.322659] NVRM: ioctl(0x4d, 0xd0b0b150, 0x40) [ 411.322686] NVRM: ioctl(0x2a, 0xd0b0b0a0, 0x20) [ 411.323285] NVRM: nv_kern_open... [ 411.323289] NVRM: nv_kern_open on device 0 [ 411.323294] NVRM: Incorrect BAR1 = 0x4000000c, restoring 0xe0000000 [ 411.323298] NVRM: Incorrect BAR1 = 0x000000004000000c, restoring 0x00000000e0000000 [ 411.323313] NVRM: RmInitAdapter: 2:0 [ 411.323316] NVRM: RmSetupRegisters for 0x10de:0x141 [ 411.323317] NVRM: pci config info: [ 411.323319] NVRM: registers look like: 0xf0000000 0x4000000 [ 411.323322] NVRM: fb looks like: 0xe0000000 0x10000000 [ 411.323324] NVRM: warning, kernel thinks our registers are 67108864, scaling back to 16777216 [ 411.323355] NVRM: Successfully mapped framebuffer and registers [ 411.323357] NVRM: final mappings: [ 411.323359] NVRM: regs: 0xf0000000 0x1000000 0x580000
changed summary to reflect broader problem found
changed version to 7.2 changed severity to major since it hangs the machine when the truncated address is 0x0
I just booted with no RPU and observed that after starting the Xserver the machine is unusable. It behaves differently to the case where the RPU BAR > 2GB but it still is unable to run the Xserver. In this case the bios allocates 0xfcf0000000 as the framebuffer address and 0xf000000 for BAR0 on the nvidia card. Hence the truncated FB address is the same as BAR0 and it breaks.
pciBusAddrToHostAddr is broken on x86 64-bit Linux platforms because xf86GetOSOffsetFromPCI does not check whether the BAR has the 64-bit memory attribute bit set. Therefore, the routine cannot be doing comparisons against base addresses using only 32-bits of a 40-bit physical address.
changing xf86GetOSOffsetFromPCI to use pci_device_cfg_read_u32(dev, &savePtr, offset) after merging the pcirework branch doesn't take into account whether the BAR refers to a 64-bit BAR or not.
After hacking the linuxbios code to allocate prefetchable PCI resources in the non-prefetchable region for PCI devices in the display class, the Xserver happily works again with the Nvidia card. This hack effectively just means the framebuffer BAR is allocated below 4GB. IMO, this is further evidence 64-bit BAR support is broken. Below is part of Xorg log file showing the Nvidia framebuffer allocated at offset 0xd0000000 with size 256MB. [18] -1 0 0xfcf0200000 - 0xfcf021ffff (0x20000) MX[B] [19] -1 0 0xfce0000000 - 0xfcffffffff (0x20000000) MX[B] [20] -1 0 0xe5000000 - 0xe501ffff (0x20000) MX[B](B) [21] -1 0 0xe4000000 - 0xe4ffffff (0x1000000) MX[B](B) [22] -1 0 0xd0000000 - 0xdfffffff (0x10000000) MX[B](B) [23] -1 0 0xe0000000 - 0xe3ffffff (0x4000000) MX[B](B) 02:00.0 VGA compatible controller: nVidia Corporation NV43 [GeForce 6600] (rev a2) (prog-if 00 [VGA]) Subsystem: ASUSTeK Computer Inc. Unknown device 81b0 Flags: bus master, fast devsel, latency 0, IRQ 18 Memory at e0000000 (32-bit, non-prefetchable) [size=64M] Memory at d0000000 (64-bit, prefetchable) [size=256M] Memory at e4000000 (64-bit, non-prefetchable) [size=16M] Expansion ROM at e5000000 [disabled] [size=128K] Capabilities: [60] Power Management version 2 Capabilities: [68] Message Signalled Interrupts: Mask- 64bit+ Queue=0/0 Enable- Capabilities: [78] Express Endpoint IRQ 0
Which bits is this with? Have you tried core xserver since the pci-rework merge? AFAIK, that should make this a non-issue.
Michael, Please respond with a status update. I'm assuming that this problem no longer exists since PCI-rework landed in the trunk. If I don't hear back soon, I'm going to continue this assumption and close the bug. Thanks.
I waited a year. Closing as stale.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.