Bug 101322 - GM108/NV118: 0 MiB DDR3 and boot crash in gf100_ltc_oneinit_tag_ram
Summary: GM108/NV118: 0 MiB DDR3 and boot crash in gf100_ltc_oneinit_tag_ram
Status: NEW
Alias: None
Product: xorg
Classification: Unclassified
Component: Driver/nouveau (show other bugs)
Version: unspecified
Hardware: Other All
: medium normal
Assignee: Nouveau Project
QA Contact: Xorg Project Team
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-06-06 20:30 UTC by Daniel Drake
Modified: 2017-10-20 08:33 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
full dmesg log (53.37 KB, text/x-log)
2017-06-06 20:30 UTC, Daniel Drake
no flags Details
dmesg with nouveau.debug=debug (275.53 KB, text/plain)
2017-06-07 13:10 UTC, Daniel Drake
no flags Details
test disabling pci link config twiddling (666 bytes, patch)
2017-06-14 00:40 UTC, Ben Skeggs
no flags Details | Splinter Review

Note You need to log in before you can comment on or make changes to this bug.
Description Daniel Drake 2017-06-06 20:30:04 UTC
Created attachment 131742 [details]
full dmesg log

On the Acer Z20-730 laptop, the nouveau driver crashes during boot with:

[    4.041108] nouveau 0000:01:00.0: pci: failed to adjust cap speed
[    4.041167] nouveau 0000:01:00.0: pci: failed to adjust lnkctl speed
[    9.633613] nouveau 0000:01:00.0: fb: 0 MiB DDR3
[   20.811768] divide error: 0000 [#1] SMP
[   20.813654] Modules linked in: hid_generic usbmouse usbkbd usbhid i915 nouveau(+) mxm_wmi i2c_algo_bit drm_kms_helper sdhci_pci syscopyarea sysfillrect sysimgblt fb_sys_fops ttm sdhci drm ahci libahci wmi i2c_hid hid video
[   20.815697] CPU: 3 PID: 200 Comm: systemd-udevd Not tainted 4.11.0-2-generic #7+dev143.9f9ecd2beos3.2.2-Endless
[   20.817684] Hardware name: Acer Aspire Z20-730/IPMAL-BR3, BIOS D01 07/07/2016
[   20.819711] task: ffff8a3070288000 task.stack: ffffa3eb4103c000
[   20.821762] RIP: 0010:gf100_ltc_oneinit_tag_ram+0xba/0x100 [nouveau]
[   20.823789] RSP: 0018:ffffa3eb4103f6b8 EFLAGS: 00010206
[   20.825773] RAX: 00001000ffefdfff RBX: ffff8a306f915000 RCX: ffff8a3075570030
[   20.827820] RDX: 0000000000000000 RSI: dead000000000200 RDI: ffff8a307fd9b700
[   20.829797] RBP: ffffa3eb4103f6d0 R08: 000000000001e980 R09: ffff8a3077003900
[   20.831825] R10: ffffa3eb40cdbda0 R11: ffff8a307fd986a4 R12: 0000000000000000
[   20.833814] R13: 0000000100005fff R14: ffff8a306fa2e400 R15: ffff8a306f914e00
[   20.835882] FS:  00007f456d052900(0000) GS:ffff8a307fd80000(0000) knlGS:0000000000000000
[   20.837874] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   20.839935] CR2: 00007fefceb1c020 CR3: 000000026fbc5000 CR4: 00000000003406e0
[   20.841918] Call Trace:
[   20.843972]  gm107_ltc_oneinit+0x7c/0x90 [nouveau]
[   20.845952]  nvkm_ltc_oneinit+0x13/0x20 [nouveau]
[   20.847991]  nvkm_subdev_init+0x50/0x210 [nouveau]
[   20.849977]  nvkm_device_init+0x151/0x270 [nouveau]
[   20.851997]  nvkm_udevice_init+0x48/0x60 [nouveau]
[   20.853944]  nvkm_object_init+0x40/0x190 [nouveau]
[   20.855924]  nvkm_ioctl_new+0x179/0x290 [nouveau]
[   20.857838]  ? nvkm_client_notify+0x30/0x30 [nouveau]
[   20.859794]  ? nvkm_udevice_rd08+0x30/0x30 [nouveau]
[   20.861674]  nvkm_ioctl+0x168/0x240 [nouveau]
[   20.863576]  ? nvif_client_init+0x42/0x110 [nouveau]
[   20.865449]  nvkm_client_ioctl+0x12/0x20 [nouveau]
[   20.867368]  nvif_object_ioctl+0x42/0x50 [nouveau]
[   20.869237]  nvif_object_init+0xc2/0x130 [nouveau]
[   20.871141]  nvif_device_init+0x12/0x30 [nouveau]
[   20.872994]  nouveau_cli_init+0x15e/0x1d0 [nouveau]
[   20.874873]  nouveau_drm_load+0x67/0x8b0 [nouveau]
[   20.876674]  ? sysfs_do_create_link_sd.isra.2+0x70/0xb0
[   20.878451]  drm_dev_register+0x148/0x1e0 [drm]
[   20.880302]  drm_get_pci_dev+0xa0/0x160 [drm]
[   20.882166]  nouveau_drm_probe+0x1d9/0x260 [nouveau]

This has been reproduced on 4.12-rc3. Please let us know how we can help with further debugging.
Comment 1 Karol Herbst 2017-06-06 20:52:26 UTC
(In reply to Daniel Drake from comment #0)
> Created attachment 131742 [details]
> full dmesg log
> 
> On the Acer Z20-730 laptop, the nouveau driver crashes during boot with:
> 
> [    4.041108] nouveau 0000:01:00.0: pci: failed to adjust cap speed
> [    4.041167] nouveau 0000:01:00.0: pci: failed to adjust lnkctl speed
> [    9.633613] nouveau 0000:01:00.0: fb: 0 MiB DDR3
> [   20.811768] divide error: 0000 [#1] SMP
> [   20.813654] Modules linked in: hid_generic usbmouse usbkbd usbhid i915
> nouveau(+) mxm_wmi i2c_algo_bit drm_kms_helper sdhci_pci syscopyarea
> sysfillrect sysimgblt fb_sys_fops ttm sdhci drm ahci libahci wmi i2c_hid hid
> video
> [   20.815697] CPU: 3 PID: 200 Comm: systemd-udevd Not tainted
> 4.11.0-2-generic #7+dev143.9f9ecd2beos3.2.2-Endless
> [   20.817684] Hardware name: Acer Aspire Z20-730/IPMAL-BR3, BIOS D01
> 07/07/2016
> [   20.819711] task: ffff8a3070288000 task.stack: ffffa3eb4103c000
> [   20.821762] RIP: 0010:gf100_ltc_oneinit_tag_ram+0xba/0x100 [nouveau]
> [   20.823789] RSP: 0018:ffffa3eb4103f6b8 EFLAGS: 00010206
> [   20.825773] RAX: 00001000ffefdfff RBX: ffff8a306f915000 RCX:
> ffff8a3075570030
> [   20.827820] RDX: 0000000000000000 RSI: dead000000000200 RDI:
> ffff8a307fd9b700
> [   20.829797] RBP: ffffa3eb4103f6d0 R08: 000000000001e980 R09:
> ffff8a3077003900
> [   20.831825] R10: ffffa3eb40cdbda0 R11: ffff8a307fd986a4 R12:
> 0000000000000000
> [   20.833814] R13: 0000000100005fff R14: ffff8a306fa2e400 R15:
> ffff8a306f914e00
> [   20.835882] FS:  00007f456d052900(0000) GS:ffff8a307fd80000(0000)
> knlGS:0000000000000000
> [   20.837874] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   20.839935] CR2: 00007fefceb1c020 CR3: 000000026fbc5000 CR4:
> 00000000003406e0
> [   20.841918] Call Trace:
> [   20.843972]  gm107_ltc_oneinit+0x7c/0x90 [nouveau]
> [   20.845952]  nvkm_ltc_oneinit+0x13/0x20 [nouveau]
> [   20.847991]  nvkm_subdev_init+0x50/0x210 [nouveau]
> [   20.849977]  nvkm_device_init+0x151/0x270 [nouveau]
> [   20.851997]  nvkm_udevice_init+0x48/0x60 [nouveau]
> [   20.853944]  nvkm_object_init+0x40/0x190 [nouveau]
> [   20.855924]  nvkm_ioctl_new+0x179/0x290 [nouveau]
> [   20.857838]  ? nvkm_client_notify+0x30/0x30 [nouveau]
> [   20.859794]  ? nvkm_udevice_rd08+0x30/0x30 [nouveau]
> [   20.861674]  nvkm_ioctl+0x168/0x240 [nouveau]
> [   20.863576]  ? nvif_client_init+0x42/0x110 [nouveau]
> [   20.865449]  nvkm_client_ioctl+0x12/0x20 [nouveau]
> [   20.867368]  nvif_object_ioctl+0x42/0x50 [nouveau]
> [   20.869237]  nvif_object_init+0xc2/0x130 [nouveau]
> [   20.871141]  nvif_device_init+0x12/0x30 [nouveau]
> [   20.872994]  nouveau_cli_init+0x15e/0x1d0 [nouveau]
> [   20.874873]  nouveau_drm_load+0x67/0x8b0 [nouveau]
> [   20.876674]  ? sysfs_do_create_link_sd.isra.2+0x70/0xb0
> [   20.878451]  drm_dev_register+0x148/0x1e0 [drm]
> [   20.880302]  drm_get_pci_dev+0xa0/0x160 [drm]
> [   20.882166]  nouveau_drm_probe+0x1d9/0x260 [nouveau]
> 
> This has been reproduced on 4.12-rc3. Please let us know how we can help
> with further debugging.

does this error appear on older kernels?

Can you get a dmesg booted with "nouveau.debug=debug"?

thanks
Comment 2 Daniel Drake 2017-06-07 13:10:06 UTC
Created attachment 131772 [details]
dmesg with nouveau.debug=debug

Here is the debug log. By the way, this is exactly the same report as was posted on the nouveau list "Kernel panic on nouveau during boot on NVIDIA NV118 (GM108)" - I just duplicated it here after no response on the list (plus I think here is the more appropriate place?)

Thanks for your help so far!
Comment 3 Daniel Drake 2017-06-07 13:11:21 UTC
And yes this problem has been reproduced on v4.8, v4.11, v4.12-rc3. We don't know of any working kernels.
Comment 4 Karol Herbst 2017-06-07 13:33:24 UTC
can you also upload your vbios.rom file located in /sys/kernel/debug/dri/0/vbios.rom ? And if you are up for it, install envytools and do nvapeek 101000 as root? Second is optional, but may help us even more.
Comment 5 Daniel Drake 2017-06-08 14:17:00 UTC
vbios.rom is empty. We will try to get envytools running now.
Comment 6 Daniel Drake 2017-06-13 15:02:26 UTC
The nvapeek 101000 output is "PCI init failure".
Comment 7 Ben Skeggs 2017-06-14 00:40:34 UTC
Created attachment 131939 [details] [review]
test disabling pci link config twiddling

Can you give this patch a try?  It looks like things are working normally up until we try to fiddle with the PCIE link configuration, and I'd like to rule this in/out as a culprit.
Comment 8 Daniel Drake 2017-08-17 08:00:09 UTC
Sorry for the slow response.
We tested the patch against 4.13.rc5 and the issue is still there.
Comment 9 Daniel Drake 2017-10-12 08:31:11 UTC
Problem still present on 4.14-rc4
Comment 10 Ben Skeggs 2017-10-12 20:11:26 UTC
I don't suppose you'd be able to grab a mmiotrace[1] of the proprietary driver for me?  One of Nouveau might be useful also.

[1] https://nouveau.freedesktop.org/wiki/MmioTrace/
Comment 11 Daniel Drake 2017-10-18 08:00:04 UTC
Sent partial dump from nvidia proprietary driver to mmio.dumps address:


I loaded the module and then started an empty X session.
Unfortunately with tracing enabled, this results in an instant hard hang
upon starting X, before it has rendered the black screen.

However I managed to capture these messages over the network up until
the point of hang.

Captured on Linux 4.13. No external displays connected. The all-in-one PC
has one internal LCD display.
Comment 12 Ben Skeggs 2017-10-18 19:07:17 UTC
This patch should (hopefully) help with the issues faced while tracing the binary driver, it will apply to their kernel shim layer sourcecode that can be found if you extract the installer package with --extract-only (and run ./nvidia-installer inside the extracted directory to build the patched kernel module).

https://paste.fedoraproject.org/paste/LkiC1cJdfPGOLc~NlKWkcA
Comment 13 Daniel Drake 2017-10-20 08:33:32 UTC
That worked. sent an updated dump


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.