Bug 105174 - [GF108][Regression] Unable to handle NULL pointer dereference in nouveau_mem_host since kernel 4.15.3
Summary: [GF108][Regression] Unable to handle NULL pointer dereference in nouveau_mem_...
Status: RESOLVED FIXED
Alias: None
Product: xorg
Classification: Unclassified
Component: Driver/nouveau (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) All
: medium normal
Assignee: Nouveau Project
QA Contact: Xorg Project Team
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-02-20 11:06 UTC by Dominik 'Rathann' Mierzejewski
Modified: 2019-06-01 06:07 UTC (History)
5 users (show)

See Also:
i915 platform:
i915 features:


Attachments
full dmesg (77.47 KB, text/plain)
2018-02-20 11:06 UTC, Dominik 'Rathann' Mierzejewski
no flags Details
Log from journalctl (14.45 KB, text/plain)
2018-02-22 15:18 UTC, Philip Raets
no flags Details
dmesg from hang on Hugh's system (74.18 KB, text/plain)
2018-02-24 20:59 UTC, D. Hugh Redelmeier
no flags Details
Proposed patch (1.00 KB, patch)
2018-03-01 18:22 UTC, Pierre Moreau
no flags Details | Splinter Review
Bootlog patched kernel 4.15.7 (53.28 KB, text/plain)
2018-03-05 12:34 UTC, Philip Raets
no flags Details
Log from patched opensuse (5.36 KB, text/x-log)
2018-03-15 10:02 UTC, Philip Raets
no flags Details
dmesg with kernel-4.15.11-300.fc27 (4.72 KB, text/plain)
2018-03-23 09:30 UTC, Stefano Biagiotti
no flags Details
journalctl (kernel-4.15.12-301.fc27.x86_64 (4.66 KB, text/plain)
2018-03-26 09:47 UTC, Stefano Biagiotti
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Dominik 'Rathann' Mierzejewski 2018-02-20 11:06:05 UTC
Created attachment 137463 [details]
full dmesg

After updating to Fedora kernel 4.15.3-300.fc27.x86_64, I got:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000040
IP: nouveau_mem_host+0x47/0x1b0 [nouveau]

Full dmesg attached. The hardware is Dell XPS 15 with Intel and nVidia GPU in Optimus configuration.

Note: this looks different from bug 105173.
Comment 1 Dominik 'Rathann' Mierzejewski 2018-02-20 11:15:40 UTC
After logging in I get no output on the second screen attached to HDMI port and the Xorg session doesn't start fully. I can only see the wallpaper on the built-in display. Mouse cursor moves, but doesn't respond to clicks. Machine remains accessible via ssh. There are no errors or warnings in Xorg log.
Comment 2 Dominik 'Rathann' Mierzejewski 2018-02-20 11:16:11 UTC
4.14.18-300.fc27.x86_64 is the last working Fedora kernel.
Comment 3 Pierre Moreau 2018-02-20 11:20:04 UTC
Thank you for the bug report.
Is this easily reproducible (and if yes, how)? Otherwise, what were you doing when this happened?
Comment 4 Dominik 'Rathann' Mierzejewski 2018-02-20 16:47:24 UTC
It happens on every login. My user display configuration is such that it switches the HDMI output on and internal LCD off, while the login screen (lightdm) uses the internal LCD only. Immediately after logging in, I get the display freeze when it tries to drive the HDMI output. The mouse cursor is still moving.

Apparently the nternal LCD is driven by i915 directly while the HDMI output goes via nouveau.
Comment 5 Philip Raets 2018-02-22 15:18:39 UTC
Created attachment 137532 [details]
Log from journalctl
Comment 6 Philip Raets 2018-02-22 15:19:29 UTC
I have the same error in the kernel with the nouveau driver on openSUSE Tumbleweed (kernel 4.15.3 and 4.15.4). This is on a desktop (Dell Optiplex 790 with an NVIDIA GT218 (NVS300) dual display)

They seem to happen at random in my system. I can't pinpoint an action when this occurs. 

Attached the logs from journalctl

My bugreport on openSUSE:
https://bugzilla.opensuse.org/show_bug.cgi?id=1082308

Greetings,

Philip
Comment 7 D. Hugh Redelmeier 2018-02-24 20:55:52 UTC
I got here by googling for "IP: nouveau_mem_host+0x47/0x1b0 [nouveau]".  This leads me to think that my problem is (partly) this problem.

As I type this (using a different computer) my screen is hung.  But my computer is working.

The machine is running Fedora 27 with all updates (except the latest proprietary nvidia driver, which fails).  The kernel is 4.15.4-300.fc27.x86_64

I'm intending that the nouveau driver be suppressed in favour of the nvidia driver.  Historically (i.e. for about 3 years) nouveau didn't work on this setup (GTX 650 driving a Seiki UltraHD monitor at 30Hz).

For some reason, nouveau is running and the nvidia driver is not.  Even though I have these kernel parameters:
rd.driver.blacklist=nouveau modprobe.blacklist=nouveau nvidia-drm.modeset=1

In any case, nouveau is now running, and I can log in, but when I start firefox (with a LOT of tabs), nouveau dereferences a NULL and the screen freezes (but not the mouse).

I seem to remember that when I tried nouveau in the past, it would hang in a similar way.  My hypothesis (untested) was that my large number of firefox tabs would exhaust some nouveau resource.  I did not have that problem with the nvidia proprietary driver.

So I have multiple problems, but one of them is this nouveau bug.  I don't expect a solution to my other problems to come up in this bz.

I will attach a dmesg.

PS: why is the status NEEDSINFO?  I don't see where there is an outstanding request for info.  I will try to change the status to NEW.
Comment 8 D. Hugh Redelmeier 2018-02-24 20:59:09 UTC
Created attachment 137583 [details]
dmesg from hang on Hugh's system
Comment 9 Pierre Moreau 2018-03-01 18:22:26 UTC
Created attachment 137731 [details] [review]
Proposed patch

Could anyone please try the attached patch (from https://github.com/skeggsb/nouveau/pull/1)?

(In reply to D. Hugh Redelmeier from comment #7)
> PS: why is the status NEEDSINFO?  I don't see where there is an outstanding
> request for info.  I will try to change the status to NEW.

Simply because no one changed the status since the information was provided. :-)
Comment 10 Philip Raets 2018-03-05 12:34:44 UTC
Created attachment 137793 [details]
Bootlog patched kernel 4.15.7

Hi,

I've tried a patched kernel provided by Takashi Iwai on openSUSE (see http://bugzilla.opensuse.org/show_bug.cgi?id=1082308 )

But then my system would crash at startup

Included my bootlog with that kernel.
Comment 11 Dominik 'Rathann' Mierzejewski 2018-03-09 19:51:01 UTC
(In reply to Pierre Moreau from comment #9)
> Created attachment 137731 [details] [review] [review]
> Proposed patch
> 
> Could anyone please try the attached patch (from
> https://github.com/skeggsb/nouveau/pull/1)?

I can confirm that the patch fixes the bug for me when applied to Fedora kernel (4.15.7-300.fc27 tested this time). Thanks!
Comment 13 Philip Raets 2018-03-15 10:02:27 UTC
Created attachment 138127 [details]
Log from patched opensuse

I've installed a patched kernel for opensuse (details: https://bugzilla.opensuse.org/show_bug.cgi?id=1082308)

But the problem still occurs when opening JPG's like https://www.dropbox.com/s/gex21o67q31aytx/PRP_5808.jpg?dl=0

This are JPG's that I exported from Darktable

attached is the error from journalctl

I have to login through ssh with my phone and then I can force a reboot. (the graphics freeze, only the cursor is working)
Comment 14 Pierre Moreau 2018-03-17 14:54:52 UTC
Fyi, the attached patch has been submitted along fixes to DRM. It doesn’t look like it has landed yet, but might be part of 4.16-rc6.

(In reply to Philip Raets from comment #13)
> Created attachment 138127 [details]
> Log from patched opensuse
> 
> I've installed a patched kernel for opensuse (details:
> https://bugzilla.opensuse.org/show_bug.cgi?id=1082308)
> 
> But the problem still occurs when opening JPG's like
> https://www.dropbox.com/s/gex21o67q31aytx/PRP_5808.jpg?dl=0
> 
> This are JPG's that I exported from Darktable
> 
> attached is the error from journalctl
> 
> I have to login through ssh with my phone and then I can force a reboot.
> (the graphics freeze, only the cursor is working)

Since the patch works for the bug report author but not for you, I think you are experiencing another (or an additional) issue. Please open a separate bug report.
Comment 15 Pierre Moreau 2018-03-17 14:56:13 UTC
And thank you all for your replies and trying out the patch :-)
Comment 16 Dominik 'Rathann' Mierzejewski 2018-03-19 09:35:48 UTC
(In reply to Pierre Moreau from comment #15)
> And thank you all for your replies and trying out the patch :-)

Thank you for the quick fix!
Comment 17 Stefano Biagiotti 2018-03-23 09:30:18 UTC
Created attachment 138305 [details]
dmesg with kernel-4.15.11-300.fc27

I am on Fedora 27 x86_64 with MATE Desktop Environment.
The fix posted here has been included in the new package kernel-4.15.11-300.fc27.x86_64.rpm. (Font: https://bugzilla.redhat.com/show_bug.cgi?id=1547037 )

I installed kernel-4.15.11-300.fc27 from updates-testing repository but it doesn't resolve.
 $ LANG=en dnf list kernel-4.15.11-300.fc27
 Failed to set locale, defaulting to C
 Last metadata expiration check: 0:00:15 ago on Thu Mar 22 15:50:44 2018.
 Installed Packages
 kernel.x86_64           4.15.11-300.fc27                @updates-testing

Same freeze after login via lightdm.
Dmesg attached.
Comment 18 Stefano Biagiotti 2018-03-26 09:47:26 UTC
Created attachment 138354 [details]
journalctl (kernel-4.15.12-301.fc27.x86_64

Fedora kernel-4.15.12-301.fc27.x86_64 from updates-testing repository still doesn't resolve.

Display adapter (from lspci) is:
01:00.0 VGA compatible controller: NVIDIA Corporation G98 [GeForce 8400 GS Rev. 2] (rev a1)

Attach is an excerpt from "journalctl -k -b -1 --no-pager --no-hostname".
Comment 19 Pierre Moreau 2018-03-26 11:01:47 UTC
(In reply to Stefano Biagiotti from comment #18)
> Created attachment 138354 [details]
> journalctl (kernel-4.15.12-301.fc27.x86_64
> 
> Fedora kernel-4.15.12-301.fc27.x86_64 from updates-testing repository still
> doesn't resolve.
> 
> Display adapter (from lspci) is:
> 01:00.0 VGA compatible controller: NVIDIA Corporation G98 [GeForce 8400 GS
> Rev. 2] (rev a1)
> 
> Attach is an excerpt from "journalctl -k -b -1 --no-pager --no-hostname".

The patch was reported as working by the person who opened this bug report, so I am changing this bug report back to fixed. Since it does not seem to be the case for you (and you are using a GPU from a different family, Tesla vs Fermi), you should open a different bug report.
There has been other reports of the patch not being enough on another Tesla card (though on a different chipset): you might want to look at https://bugs.freedesktop.org/show_bug.cgi?id=105626 and https://bugs.freedesktop.org/show_bug.cgi?id=105687.

Also, please try to avoid posting excerpts of logs: there can other errors happening before this NULL pointer dereference, and seeing the different messages outputed by Nouveau during its initialisation can help shed some light on what is going wrong; for example, there is a bug report, also on G98, of EVO timing out since updating to 4.15 (https://bugs.freedesktop.org/show_bug.cgi?id=105319), maybe you are experiencing that as well?
Comment 20 Ilia Mirkin 2018-05-09 12:18:34 UTC
Should fix the nouveau_mem_host issue:

https://github.com/skeggsb/nouveau/commit/bdc36dcf3fe469e6bb2a1366452dcb16b84e8bcf


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.