Created attachment 117481 [details]
dmesg log with enabled nouveau debug
When the kernel tries to load the nouveau kernel module during boot, it fails with "failed to parse ramcfg data"(see attached logs). It also happens when I try to load the module manually later. This is a bug which occurred between kernel 3.17 and 3.18(downgrading to 3.17 fixes the problem, each version above doesn't work). I'm using nvidia gt645m(gk107) on a Lenovo z500 with intel i5-3230 processor. OS: tested on arch linux(both stock & -ck kernel) and fedora 22. Let me know if any other info would be needed.
[ 3.766892] nouveau D[ PFB][0000:01:00.0] 0x100800: 0x00000002
[ 3.766892] nouveau D[ PFB][0000:01:00.0] parts 0x00000002 mask 0x00000000
[ 3.766900] nouveau D[ PFB][0000:01:00.0] 0: mem_amount 0x00000400
[ 3.766902] nouveau D[ PFB][0000:01:00.0] 1: mem_amount 0x00000400
[ 3.766913] nouveau E[ PFB][0000:01:00.0] failed to parse ramcfg data
[ 3.766914] nouveau E[ PFB][0000:01:00.0] failed to create 0x00000000, -22
[ 3.766915] nouveau ![ PFB][0000:01:00.0] error detecting memory configuration!!
[ 3.766916] nouveau E[ DEVICE][0000:01:00.0] failed to create 0x1000e00b, -22
Could you attach your vbios.rom from a successful boot (/sys/kernel/debug/dri/1/vbios.rom)
Created attachment 117483 [details]
It seems that the code that triggers this bug is in drivers/gpu/drm/nouveau/nvkm/subdev/fb/ramgk104.cpp around line 1583. It seems that the assumption in the comment is not true, on my card this condition obviously doesn't mean that the card should be ignored completely. This maybe has some other meaning, does anybody know how I can help debugging the problem? I'm posting the comment in the code:
/* parse bios data for all rammap table entries up-front, and
* build information on whether certain fields differ between
* any of the entries.
* the binary driver appears to completely ignore some fields
* when all entries contain the same value. at first, it was
* hoped that these were mere optimisations and the bios init
* tables had configured as per the values here, but there is
* evidence now to suggest that this isn't the case and we do
* need to treat this condition as a "don't touch" indicator.
I tested a kernel build with the relevant code commented(right after the comment I posted) and indeed the kernel module works! (it loads and DRI_PRIME=1 allows me to run glxgears/glxinfo). What is more, now I get better power savings because when driver is loaded vgaswitcheroo can turn off the nvidia card.
I still get this error in 4.14.1. Same laptop, same vbios.rom. If I comment out the "return ret" in https://github.com/torvalds/linux/blob/master/drivers/gpu/drm/nouveau/nvkm/subdev/fb/ramgk104.c#L1577, it boots and runs 3d acceleration with xrandr offload, but cannot reclock successfully.
Looking at envytools' nvbios decode, given ramcfg=1, the first entry of the timing mapping table (0x718f) has a value of 0xf at index 1, offset 0. This is taken as index into the timing table (0x733c); however, the timing table only defines 12 entries, and only six of them are nonzero.
If I simply skip entry 0, reclocking works fine. Maybe that's what 0x0f indicates?