Bug 111371 - [NV04] bios OOB on kernel driver initialization
Summary: [NV04] bios OOB on kernel driver initialization
Status: NEW
Alias: None
Product: xorg
Classification: Unclassified
Component: Driver/nouveau (show other bugs)
Version: unspecified
Hardware: x86 (IA32) Linux (All)
: medium normal
Assignee: Nouveau Project
QA Contact: Xorg Project Team
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-08-11 15:33 UTC by Jorge Natz
Modified: 2019-08-12 19:18 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
The non-verbose kernel log (12.60 KB, text/plain)
2019-08-11 15:33 UTC, Jorge Natz
no flags Details
The more verbose kernel log with the external module (22.00 KB, text/plain)
2019-08-11 15:41 UTC, Jorge Natz
no flags Details
dmesg log for nouveau.debug=bios=debug parameter, with nouveau as "tainted" external module (22.88 KB, text/plain)
2019-08-11 16:19 UTC, Jorge Natz
no flags Details
PRAMIN VBIOS dump (64.00 KB, application/octet-stream)
2019-08-11 20:12 UTC, Jorge Natz
no flags Details
PROM VBIOS dump (1.00 MB, application/octet-stream)
2019-08-11 20:12 UTC, Jorge Natz
no flags Details
Fetch workaround kernel log (25.80 KB, text/plain)
2019-08-12 03:26 UTC, Jorge Natz
no flags Details
Fetch workaround kernel log (new kernel) (42.82 KB, text/plain)
2019-08-12 15:24 UTC, Jorge Natz
no flags Details
Fetch workaround kernel log (new kernel) (43.41 KB, text/plain)
2019-08-12 15:51 UTC, Jorge Natz
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Jorge Natz 2019-08-11 15:33:56 UTC
Created attachment 145023 [details]
The non-verbose kernel log

Steps to reproduce:
Boot linux with NV04 card in motherboard AGP slot.
OR
'modprobe nouveau' with NV04 card in motherboard AGP slot.

Actual Results:
2 times during the boot process the screen turns black, only to return back to the character display. The nouveau module ends up loaded, but the screen appears to not be using the drm driver.

Platform:
Alpine Linux 3.10.1 Standard, kernel 4.19.58 (non-verbose log)
Tiny Core Linux, kernel 4.19.10 (verbose external module log)

I tried using the kernel parameter nouveau.config=NvMSI=0, but it made no difference. I also tried using nouveau.config=NvBios=PRAMIN, even though there were no "checksum invalid" errors in the dmesg output.

I was unable to get a VBIOS dump, as the /sys/kernel/debug/dri folder was empty.

I was also hesitant to do a git bisection, as compiling the nouveau kernel module takes upwards of 7 hours on the machine I am using. However, I can try if you would like.

I have two kernel logs, one is from kernel 4.19.58 and has nouveau compiled as an in-tree module. The other log is for kernel 4.19.10 and has nouveau compiled as an out-of-tree module, but I was able to add the kernel parameters drm.debug=14 and log_buf_len=16M.

I apologize if I left any information out, and thank you for any reply.
Comment 1 Jorge Natz 2019-08-11 15:41:43 UTC
Created attachment 145024 [details]
The more verbose kernel log with the external module
Comment 2 Ilia Mirkin 2019-08-11 15:53:19 UTC
Sounds like there's something weird going on:

[    1.862577] pci 0000:01:00.0: BAR 6: no space for [mem size 0x00010000 pref]
[    1.862592] pci 0000:01:00.0: BAR 6: failed to assign [mem size 0x00010000 pref]

So ... question 1: did this ever work, with any kernel?

I'll be honest - I've never actually tried a NV4 (no AGP motherboard here). Last I tried a NV5 (PCI version), it worked fine. This was some time ago, and I can re-check if necessary. However:

[  103.927615] nouveau 0000:01:00.0: pci: failed to acquire agp

is definitely worrying. From the kernel messages, it would appear that the agpgart support module is loading *after* nouveau. This is not good. Try compiling nouveau as a module, and ensuring that it loads after the agpgart module.

Of course, I don't know that this is highly related to the issue at hand -- AGP is for the GPU accessing system memory, which is not necessary to retrieve the VBIOS data.

Please try booting the 4.x kernel with

nouveau.debug=bios=debug

which might yield more info. Please don't use NvBios=PRAMIN when doing this, and include the full boot log.

You can also use nvagetbios from envytools to fetch using certain methods (definitely the ones applicable to your hw).
Comment 3 Jorge Natz 2019-08-11 16:19:01 UTC
These are the only two kernels I have tried with this card, and it has never worked on my machine. I was able to get a graphical display with the xorg vesa driver, though.

I also tried reloading the kernel module after agpgart loading using "rmmod nouveau" and then "modprobe nouveau" after boot, but it made no difference.

I will attach a dmesg log with nouveau.config=bios=debug for the kernel with nouveau compiled as an out-of-tree-module. It seems (to my limited knowledge) to have interesting information.
Comment 4 Jorge Natz 2019-08-11 16:19:59 UTC
Created attachment 145025 [details]
dmesg log for nouveau.debug=bios=debug parameter, with nouveau as "tainted" external module
Comment 5 Ilia Mirkin 2019-08-11 16:35:37 UTC
Could you try retrieving the image using nvagetbios and uploading it? I'm curious what's in it, perhaps our "is this a valid image" logic needs fixing. Or perhaps it's bogus data -- not 100% surprising for a NV4. We definitely do have some examples of NV4 vbios's though, and in the past people have successfully booted NV4 boards with nouveau (and filed bugs about things much later having issues, like the ddx or GL accel - you can search for them).
Comment 6 Jorge Natz 2019-08-11 20:11:58 UTC
Sorry about the long wait time, compile takes a while on my machine.

When I try to use nvagetbios without arguments, it gives me this message:

No extraction method specified (using -s extraction_method). Autodetecting.
Attempt to extract the vbios from card 0 (nv04) using PRAMIN.
Invalid checksum. Broken vbios or broken retrieval method?
Attempt to extract the vbios from card 0 (nv04) using PROM.
Invalid checksum. Broken vbios or broken retrieval method?
Autodetection failed, aborting.

Therefore I did two runs, one which specified -s prom, the other which specified -s pramin.

However, on both of these runs, it gave the message:

Attempt to extract the vbios from card 0 (nv04) using <PROM/PRAMIN>.
Invalid checksum. Broken vbios or broken retrieval method?
0xff


Thank you for spending you time in dealing with this issue.
Comment 7 Jorge Natz 2019-08-11 20:12:31 UTC
Created attachment 145030 [details]
PRAMIN VBIOS dump
Comment 8 Jorge Natz 2019-08-11 20:12:50 UTC
Created attachment 145031 [details]
PROM VBIOS dump
Comment 9 Ilia Mirkin 2019-08-11 20:53:55 UTC
(In reply to Jorge Natz from comment #6)
> When I try to use nvagetbios without arguments, it gives me this message:

Congratulations on having such an old card. Esp one that still works, and a motherboard you can plug it into. From the PRAMIN data, we can see that it's

16MB Diamond Viper TNT AGP Video Card

(which you probably knew already). So nvbios is (mostly) OK with this:

~/src/envytools/nvbios/nvbios pramin.bios 
warning: No strap specified
BIOS size 0x8000 [orig: 0x10000], 1 valid parts:

BIOS part 0 at 0x0 size 0x8000 [init: 0x8800]. Sig:
PCIR [rev 0x00]:
PCI device: 0x10de:0x0020, class 0x030000
Code type 0x00, rev 0x0001
PCIR indicator: 0x80

BIOS type: NV04

Subsystem id: 0x1092:0x5802

BMP 0x00.01 at 0x2df2

Bios version 0x30.2e.8e.7e

(note the straps thing is unrelated to this).

No tables at all decoded by nvbios though. I think that was semi-common though in those days. And the PCIR signature really is at 0x3b6f.

It's weird since the bytes "PCIR" are definitely there in the dumps at 0x3b6f, however the driver clearly sees a 0 instead of 0x52494350. Oh, that's because it thinks they're out-of-bounds... which in turn looks like it's because we only pre-fetch the first 4K. To make this work, we'd have to fetch the first 16K. Annoying.

OK, so a super-quick workaround is to change

drivers/gpu/drm/nouveau/nvkm/subdev/bios/shadow.c:shadow_image

                if (!shadow_fetch(bios, mthd, offset + 0x1000)) {

to

                if (!shadow_fetch(bios, mthd, offset + 0x4000)) {

This isn't generally OK - I don't know that all VBIOS's are even that large, but it's OK for these methods.

Ben - what do you think an appropriate workaround is for something like this?
Comment 10 Jorge Natz 2019-08-12 03:25:45 UTC
I applied the workaround you described and recompiled the nouveau module. I will attach the dmesg as well as a VBIOS dump with PRAMIN.
Comment 11 Jorge Natz 2019-08-12 03:26:39 UTC
Created attachment 145034 [details]
Fetch workaround kernel log
Comment 12 Jorge Natz 2019-08-12 03:40:13 UTC
Actually, please ask if you would like another VBIOS dump. Sorry about that.

With the workaround-compiled module, the screen turns black, then a series of messages appear on the console, seeming to be the lines in dmesg that start with "BUG: unable to handle kernel NULL pointer dereference at 00000004" and end with "CR0: 80050033 CR2: 00000004 CR3: 03304000 CR4: 000006d0". The screen then turns black again when X is started and does not revert back to console, although this may just be an incorrect X configuration on my part.
Comment 13 Ilia Mirkin 2019-08-12 03:41:54 UTC
Well that looks much happier. vbios loads fine now.

I wonder if without a DCB table, whether we generate the VGA connector anyways (since I think all NV4's always had VGA - maybe there was some variant with composite/s-video, but we probably don't support that anyways). I seem to recall code existing where we would auto-generate a connector for that case, but ... I don't remember. Might have been ppc-only.

Next, I'll have to investigate those illegal method errors. Perhaps we're doing something NV5-specific, or the logic has bitrotted. Used to work.

And finally, that null deref, we'll need symbols, otherwise it's hopeless to figure out what happened.
Comment 14 Ilia Mirkin 2019-08-12 06:48:34 UTC
Additionally a log with drm.debug=0x1e would be helpful (with the bios fix) -- I really don't see the code that adds the connectors in when there is no DCB. So either I'm blind (popular option), or that got dropped at some point.
Comment 15 Jorge Natz 2019-08-12 15:24:02 UTC
I was not able to get the symbols for the kernel that null derefed, but I compiled the module against a newer kernel version and it worked well, notwithstanding errors in the kernel log. I even got an X session to work with exa acceleration on the nouveau DDX driver. I will attached a dmesg of the newer kernel, which does not show a null deref.

Thank you for your help, and I hope I have provided sufficient information to investigate further.
Comment 16 Jorge Natz 2019-08-12 15:24:31 UTC
Created attachment 145039 [details]
Fetch workaround kernel log (new kernel)
Comment 17 Jorge Natz 2019-08-12 15:51:21 UTC
Created attachment 145040 [details]
Fetch workaround kernel log (new kernel)

This log is more representative. Please disregard the previous log.
Comment 18 Ilia Mirkin 2019-08-12 15:59:59 UTC
(In reply to Jorge Natz from comment #17)
> Created attachment 145040 [details]
> Fetch workaround kernel log (new kernel)
> 
> This log is more representative. Please disregard the previous log.

OK great! Looks like we do add the VGA connector somehow ... no clue how given that there's no DCB, but hey - don't mess with success.

I will investigate whether ILLEGAL_MTHD errors are something to worry about. I know some are expected and harmless (if annoying), but not sure about the first clump.

You should have GL 1.2 with nouveau_vieux_dri.so on this too... no HW T&L on this hardware though, only rast. And not extremely conformant (no 3d textures, no clip planes, probably other failings).

So ... question is what to do about the bios load issue. Ben, opinions welcome.
Comment 19 Jorge Natz 2019-08-12 19:11:33 UTC
Given that I know nearly nothing about DRM/VBIOS internals, I am most likely completely wrong, but would the function fabricate_dcb_encoder_table in drivers/gpu/drm/nouveau/nouveau_bios.c be what you were mentioning earlier?
Comment 20 Ilia Mirkin 2019-08-12 19:18:18 UTC
(In reply to Jorge Natz from comment #19)
> Given that I know nearly nothing about DRM/VBIOS internals, I am most likely
> completely wrong, but would the function fabricate_dcb_encoder_table in
> drivers/gpu/drm/nouveau/nouveau_bios.c be what you were mentioning earlier?

Yes, that's exactly it. Of course, it's in our OTHER bios parser, not the one in nvkm. Because you can't have too much of a good thing.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.