82152 – [NVE7] NULL deref when putting card back to sleep after unsuccessful init (HUB_INIT timeout)

Bug 82152 - [NVE7] NULL deref when putting card back to sleep after unsuccessful init (HUB_INIT timeout)

Summary: [NVE7] NULL deref when putting card back to sleep after unsuccessful init (HU...

Status:	RESOLVED MOVED

Alias:	None

Product:	Mesa
Classification:	Unclassified
Component:	Drivers/DRI/nouveau (show other bugs)
Version:	10.2
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium normal
Assignee:	Nouveau Project
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2014-08-04 21:28 UTC by Patrick Burroughs
Modified:	2019-09-18 20:39 UTC (History)
CC List:	2 users (show)

See Also:
i915 platform:
i915 features:

Attachments
Full output of journalctl, including kernel logs, between system boot and poweroff after crash. (165.12 KB, text/plain) 2014-08-04 21:28 UTC, Patrick Burroughs	Details
Output from crashing glxinfo. (15.55 KB, text/plain) 2014-08-04 21:29 UTC, Patrick Burroughs	Details
Kernel logs filtered from journal output. (44.22 KB, text/plain) 2014-08-04 21:29 UTC, Patrick Burroughs	Details
Xorg.0.log from crash. (26.28 KB, text/plain) 2014-08-04 21:30 UTC, Patrick Burroughs	Details
dmesg output post initial patch. (72.35 KB, text/plain) 2014-08-04 22:27 UTC, Patrick Burroughs	Details
dmesg output using DRI3 (73.40 KB, text/plain) 2014-08-05 00:31 UTC, Patrick Burroughs	Details
dmesg output using firmware ripped from the blob (72.81 KB, text/plain) 2014-08-05 01:28 UTC, Patrick Burroughs	Details
dmesg output with errors from successful load (11.51 KB, text/plain) 2016-04-16 12:16 UTC, Patrick Burroughs	Details
Show Obsolete (1) View All

Description Patrick Burroughs 2014-08-04 21:28:04 UTC

Created attachment 104017 [details]
Full output of journalctl, including kernel logs, between system boot and poweroff after crash.

Any OpenGL application, even as minor as glxinfo, either crashes Xorg or locks up the machine entirely (no network, magic sysrq fails) when started with DRI_PRIME=1. Has happened across multiple Mesa and kernel versions, most recently with Mesa 10.2.4 and Linux 3.15.8 on Arch Linux.

Comment 1 Patrick Burroughs 2014-08-04 21:29:12 UTC

Created attachment 104018 [details]
Output from crashing glxinfo.

Comment 2 Patrick Burroughs 2014-08-04 21:29:56 UTC

Created attachment 104019 [details]
Kernel logs filtered from journal output.

Comment 3 Patrick Burroughs 2014-08-04 21:30:34 UTC

Created attachment 104020 [details]
Xorg.0.log from crash.

Comment 4 Ilia Mirkin 2014-08-04 21:41:28 UTC

There are two issues:

(a) The null deref in the kernel when putting the card back to sleep
(b) The fact that init of the card fails

To mitigate the first, you could boot with "nouveau.runpm=0". However you still wouldn't get working accel with nouveau.

The claim by NVIDIA was that the graph-not-powered-up problem was restricted to GK104/GK106. But looking at the latest code, it seems like it runs on GK107 as well (not in Ben's repo anymore, but still in linux-3.16) and perhaps has the reverse effect there.

I wonder if a patch like

diff --git a/nvkm/engine/graph/nve4.c b/nvkm/engine/graph/nve4.c
index 51e0c07..4dd376e 100644
--- a/nvkm/engine/graph/nve4.c
+++ b/nvkm/engine/graph/nve4.c
@@ -350,7 +350,7 @@ nve4_graph_oclass = &(struct nvc0_graph_oclass) {
                .ctor = nvc0_graph_ctor,
                .dtor = nvc0_graph_dtor,
                .init = nve4_graph_init,
-               .fini = nve4_graph_fini,
+               .fini = _nouveau_graph_fini,
        },
        .cclass = &nve4_grctx_oclass,
        .sclass = nve4_graph_sclass,

will help you out. (You'll need to apply it with care... cd into drivers/gpu/drm/nouveau/core and apply it with patch -p2 )

Comment 5 Patrick Burroughs 2014-08-04 22:26:42 UTC

I get the same crash and HUB_INIT timeout after the patch. Attaching dmesg.

Comment 6 Patrick Burroughs 2014-08-04 22:27:05 UTC

Created attachment 104026 [details]
dmesg output post initial patch.

Comment 7 Tobias Klausmann 2014-08-04 22:36:38 UTC

If i look at the system + the kernel bug, this looks similar to a problem i was facing some weeks ago:

so i'd suggest to try DRI3 with the whole package:

Update your packages:
xf86-video-intel
mesa
(all dependencies of course)
Remove:
xf86-video-nouveau (with DRI3 you wont need it to do: DRI_PRIME=1 myprog)


you'll need a kernel with rendernodes enabled (boot with drm.rnodes=1)
you may need to add a file to /etc/udev/rules.d/ containing:

SUBSYSTEM=="drm", IMPORT{builtin}="path_id" 

to get ID_PATH tags for rendernodes.

Comment 8 Patrick Burroughs 2014-08-05 00:31:19 UTC

Created attachment 104031 [details]
dmesg output using DRI3

Using DRI3 defers all errors until after attempting to run an OpenGL application with DRI_PRIME=1, and prevents the crash from bringing down X or the kernel.

Comment 9 Patrick Burroughs 2014-08-05 01:28:20 UTC

Created attachment 104033 [details]
dmesg output using firmware ripped from the blob

Using DRI3 and the firmware from the blob I still get crashes, but finally have a different error message in dmesg.

Comment 10 Patrick Burroughs 2015-04-06 07:41:43 UTC

Tried again with Linux 3.19.3 and Mesa 10.5.2, no changes.

Comment 11 Patrick Burroughs 2016-04-16 12:16:38 UTC

Created attachment 122992 [details]
dmesg output with errors from successful load

With Linux 4.5.0-ARCH, Mesa 11.1.2-3, DRI3, and using modesetting_drv instead of intel_drv for the main display (not sure if that's relevant)... everything works! (If "everything" consists of glxinfo, glxgears, and a few minutes of Darwinia.)

I do still get errors in dmesg, though, as attached. I'll be happy to follow along and do whatever digging is necessary to eradicate them, if someone wants to take up that task.

Comment 12 GitLab Migration User 2019-09-18 20:39:46 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1066.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.