79518 – nouveau causes lockup and reboot on GT215

Bug 79518 - nouveau causes lockup and reboot on GT215

Summary: nouveau causes lockup and reboot on GT215

Status:	RESOLVED MOVED

Alias:	None

Product:	xorg
Classification:	Unclassified
Component:	Driver/nouveau (show other bugs)
Version:	7.7 (2012.06)
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium normal
Assignee:	Nouveau Project
QA Contact:	Xorg Project Team

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2014-06-02 00:26 UTC by Adam Borowski
Modified:	2019-12-04 08:45 UTC (History)
CC List:	0 users

See Also:
i915 platform:
i915 features:

Attachments
kernel log via a serial console (60.89 KB, text/plain) 2014-06-02 00:26 UTC, Adam Borowski	no flags	Details
another log, with DRM=debug (156.00 KB, text/plain) 2014-06-02 01:36 UTC, Adam Borowski	no flags	Details
syslog with kernel 4.6-rc2 (3.60 KB, text/plain) 2016-04-06 07:48 UTC, Adam Borowski	no flags	Details
a non-tainted dump (serial console) (6.26 KB, text/plain) 2016-04-06 18:02 UTC, Adam Borowski	no flags	Details
including boot with nouveau.debug=DRM=debug drm.debug=0xe (60.38 KB, text/plain) 2016-04-06 20:24 UTC, Adam Borowski	no flags	Details
log with 07 pstate (27.06 KB, text/plain) 2016-04-06 21:37 UTC, Adam Borowski	no flags	Details
View All

Description Adam Borowski 2014-06-02 00:26:29 UTC

Created attachment 100258 [details]
kernel log via a serial console

On my system, with nvidia GT215/240, nouveau causes random crashes on the order of 1 hour.  Typically, there's a lockup followed by a reboot a few seconds later.  With the proprietary driver, the system is stable.  The crash happens on both old and new kernels, up to 3.15-rc.

Comment 1 Adam Borowski 2014-06-02 01:36:50 UTC

Created attachment 100259 [details]
another log, with DRM=debug

TJK on IRC suggested to boot with: nouveau.debug=DRM=debug drm.debug=0xe
Here's a log with these settings.  This time, there was no delay between the lockup and reboot.  The log contains some stack traces, but they apparently come from the serial console being unable to cope with big bursts of debug info, and thus are unrelated to the problem at hand.

Comment 2 Adam Borowski 2016-04-06 07:48:10 UTC

Created attachment 122750 [details]
syslog with kernel 4.6-rc2

With kernel 4.6-rc2, there's some new output, including a stack trace.
The frequency of crashes seems to be lower, before it crashed reliably after no longer than an hour, here's the first and so far only crash after ~2 days of trying nouveau instead of proprietary.

Comment 3 Adam Borowski 2016-04-06 18:02:37 UTC

Created attachment 122773 [details]
a non-tainted dump (serial console)

Found a way to trigger it: while on 4.6-rc2 nothing during normal work or GL stuff causes the crash anymore, something Chromium does always causes a crash nearly instantly.  So here's a non-tainted serial console dump.

Comment 4 Ilia Mirkin 2016-04-06 18:07:25 UTC

Was there anything before the tlb flush timeout error? Unfortunately I have no clue why those timeouts happen... basically it's an indication that the GPU is "done". You could disable GL stuff for chromium, either launching it with LIBGL_ALWAYS_SOFTWARE=1 or by disabling stuff in about:flags.

Perhaps the memory comes up in a (slightly) funny state? You could try reclocking it and see if that improves matters, have a look in /sys/kernel/debug/dri/0/pstate for the available perf levels, you can echo those values in to change states.

Comment 5 Adam Borowski 2016-04-06 20:24:59 UTC

Created attachment 122779 [details]
including boot with nouveau.debug=DRM=debug drm.debug=0xe

> Was there anything before the tlb flush timeout error?

Nothing since boot.  I don't know what messages are interesting to you, so here's a dump since boot, this time with nouveau.debug=DRM=debug drm.debug=0xe

> Unfortunately I have no clue why those timeouts happen... basically it's an indication that the GPU is "done".

I believe my hardware itself is ok, at least as in "never had any issue with the proprietary driver".  Those bastards dropped support for GT215 though...

> You could disable GL stuff for chromium, either launching it with LIBGL_ALWAYS_SOFTWARE=1 or by disabling stuff in about:flags.

I don't care about chromium, I got a better browser :p  But whatever GL calls it does can be invoked by some other program later...

> Perhaps the memory comes up in a (slightly) funny state? You could try reclocking it and see if that improves matters, have a look in /sys/kernel/debug/dri/0/pstate for the available perf levels, you can echo those values in to change states.

What values would you suggest?  I'm afraid I did not find documentation that's idiot-proof -- and I sadly got exactly 0 clue about this kind of stuff so I require some handholding to give you useful debug info.

Comment 6 Ilia Mirkin 2016-04-06 20:27:40 UTC

(In reply to Adam Borowski from comment #5)
> > Unfortunately I have no clue why those timeouts happen... basically it's an indication that the GPU is "done".
> 
> I believe my hardware itself is ok, at least as in "never had any issue with
> the proprietary driver".  Those bastards dropped support for GT215 though...

Yeah, I'm sure your HW is fine. But it just stops responding (reasonably) to nouveau when this happens, and we don't know what to do about it. Perhaps this is "normal" and the blob drivers know how to kick it in that case. Or perhaps we're doing something wrong to push the hw over the brink of sanity.

> > Perhaps the memory comes up in a (slightly) funny state? You could try reclocking it and see if that improves matters, have a look in /sys/kernel/debug/dri/0/pstate for the available perf levels, you can echo those values in to change states.
> 
> What values would you suggest?  I'm afraid I did not find documentation
> that's idiot-proof -- and I sadly got exactly 0 clue about this kind of
> stuff so I require some handholding to give you useful debug info.

cat /sys/kernel/debug/dri/0/pstate

I can provide more instructions when you give me the output of that.

Comment 7 Adam Borowski 2016-04-06 20:32:15 UTC

03: core 135 MHz shader 270 MHz memory 135 MHz
07: core 405 MHz shader 810 MHz memory 324 MHz
0f: core 600 MHz shader 1460 MHz memory 800 MHz
AC: core 405 MHz shader 810 MHz memory 324 MHz

Comment 8 Ilia Mirkin 2016-04-06 20:37:00 UTC

(In reply to Adam Borowski from comment #7)
> 03: core 135 MHz shader 270 MHz memory 135 MHz
> 07: core 405 MHz shader 810 MHz memory 324 MHz
> 0f: core 600 MHz shader 1460 MHz memory 800 MHz
> AC: core 405 MHz shader 810 MHz memory 324 MHz

try

echo 07 > /sys/kernel/debug/dri/0/pstate

This should try to normalize the parameters while keeping the same perf level. You can also try 03 and 0f (for low and high pstates).

Comment 9 Adam Borowski 2016-04-06 21:37:22 UTC

Created attachment 122781 [details]
log with 07 pstate

With the 07 pstate, the lockup was different: chromium worked for a longish time (instead of causing a crash almost immediately).  When finally the lockup happened, the machine remained operative (sans display) -- but this might be randomness inherent in such failures.

I'll try 03.

Comment 10 Adam Borowski 2016-04-06 21:46:58 UTC

Nah, it looks like it was random luck: with 03 pstate it crashed immediately, without even a single message on the serial console.

Comment 11 Martin Peres 2019-12-04 08:45:45 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/xorg/driver/xf86-video-nouveau/issues/110.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.