Bug 79518 - nouveau causes lockup and reboot on GT215
Summary: nouveau causes lockup and reboot on GT215
Status: NEW
Alias: None
Product: xorg
Classification: Unclassified
Component: Driver/nouveau (show other bugs)
Version: 7.7 (2012.06)
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Nouveau Project
QA Contact: Xorg Project Team
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-06-02 00:26 UTC by Adam Borowski
Modified: 2016-04-06 21:46 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
kernel log via a serial console (60.89 KB, text/plain)
2014-06-02 00:26 UTC, Adam Borowski
no flags Details
another log, with DRM=debug (156.00 KB, text/plain)
2014-06-02 01:36 UTC, Adam Borowski
no flags Details
syslog with kernel 4.6-rc2 (3.60 KB, text/plain)
2016-04-06 07:48 UTC, Adam Borowski
no flags Details
a non-tainted dump (serial console) (6.26 KB, text/plain)
2016-04-06 18:02 UTC, Adam Borowski
no flags Details
including boot with nouveau.debug=DRM=debug drm.debug=0xe (60.38 KB, text/plain)
2016-04-06 20:24 UTC, Adam Borowski
no flags Details
log with 07 pstate (27.06 KB, text/plain)
2016-04-06 21:37 UTC, Adam Borowski
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Adam Borowski 2014-06-02 00:26:29 UTC
Created attachment 100258 [details]
kernel log via a serial console

On my system, with nvidia GT215/240, nouveau causes random crashes on the order of 1 hour.  Typically, there's a lockup followed by a reboot a few seconds later.  With the proprietary driver, the system is stable.  The crash happens on both old and new kernels, up to 3.15-rc.
Comment 1 Adam Borowski 2014-06-02 01:36:50 UTC
Created attachment 100259 [details]
another log, with DRM=debug

TJK on IRC suggested to boot with: nouveau.debug=DRM=debug drm.debug=0xe
Here's a log with these settings.  This time, there was no delay between the lockup and reboot.  The log contains some stack traces, but they apparently come from the serial console being unable to cope with big bursts of debug info, and thus are unrelated to the problem at hand.
Comment 2 Adam Borowski 2016-04-06 07:48:10 UTC
Created attachment 122750 [details]
syslog with kernel 4.6-rc2

With kernel 4.6-rc2, there's some new output, including a stack trace.
The frequency of crashes seems to be lower, before it crashed reliably after no longer than an hour, here's the first and so far only crash after ~2 days of trying nouveau instead of proprietary.
Comment 3 Adam Borowski 2016-04-06 18:02:37 UTC
Created attachment 122773 [details]
a non-tainted dump (serial console)

Found a way to trigger it: while on 4.6-rc2 nothing during normal work or GL stuff causes the crash anymore, something Chromium does always causes a crash nearly instantly.  So here's a non-tainted serial console dump.
Comment 4 Ilia Mirkin 2016-04-06 18:07:25 UTC
Was there anything before the tlb flush timeout error? Unfortunately I have no clue why those timeouts happen... basically it's an indication that the GPU is "done". You could disable GL stuff for chromium, either launching it with LIBGL_ALWAYS_SOFTWARE=1 or by disabling stuff in about:flags.

Perhaps the memory comes up in a (slightly) funny state? You could try reclocking it and see if that improves matters, have a look in /sys/kernel/debug/dri/0/pstate for the available perf levels, you can echo those values in to change states.
Comment 5 Adam Borowski 2016-04-06 20:24:59 UTC
Created attachment 122779 [details]
including boot with nouveau.debug=DRM=debug drm.debug=0xe

> Was there anything before the tlb flush timeout error?

Nothing since boot.  I don't know what messages are interesting to you, so here's a dump since boot, this time with nouveau.debug=DRM=debug drm.debug=0xe

> Unfortunately I have no clue why those timeouts happen... basically it's an indication that the GPU is "done".

I believe my hardware itself is ok, at least as in "never had any issue with the proprietary driver".  Those bastards dropped support for GT215 though...

> You could disable GL stuff for chromium, either launching it with LIBGL_ALWAYS_SOFTWARE=1 or by disabling stuff in about:flags.

I don't care about chromium, I got a better browser :p  But whatever GL calls it does can be invoked by some other program later...

> Perhaps the memory comes up in a (slightly) funny state? You could try reclocking it and see if that improves matters, have a look in /sys/kernel/debug/dri/0/pstate for the available perf levels, you can echo those values in to change states.

What values would you suggest?  I'm afraid I did not find documentation that's idiot-proof -- and I sadly got exactly 0 clue about this kind of stuff so I require some handholding to give you useful debug info.
Comment 6 Ilia Mirkin 2016-04-06 20:27:40 UTC
(In reply to Adam Borowski from comment #5)
> > Unfortunately I have no clue why those timeouts happen... basically it's an indication that the GPU is "done".
> 
> I believe my hardware itself is ok, at least as in "never had any issue with
> the proprietary driver".  Those bastards dropped support for GT215 though...

Yeah, I'm sure your HW is fine. But it just stops responding (reasonably) to nouveau when this happens, and we don't know what to do about it. Perhaps this is "normal" and the blob drivers know how to kick it in that case. Or perhaps we're doing something wrong to push the hw over the brink of sanity.

> > Perhaps the memory comes up in a (slightly) funny state? You could try reclocking it and see if that improves matters, have a look in /sys/kernel/debug/dri/0/pstate for the available perf levels, you can echo those values in to change states.
> 
> What values would you suggest?  I'm afraid I did not find documentation
> that's idiot-proof -- and I sadly got exactly 0 clue about this kind of
> stuff so I require some handholding to give you useful debug info.

cat /sys/kernel/debug/dri/0/pstate

I can provide more instructions when you give me the output of that.
Comment 7 Adam Borowski 2016-04-06 20:32:15 UTC
03: core 135 MHz shader 270 MHz memory 135 MHz
07: core 405 MHz shader 810 MHz memory 324 MHz
0f: core 600 MHz shader 1460 MHz memory 800 MHz
AC: core 405 MHz shader 810 MHz memory 324 MHz
Comment 8 Ilia Mirkin 2016-04-06 20:37:00 UTC
(In reply to Adam Borowski from comment #7)
> 03: core 135 MHz shader 270 MHz memory 135 MHz
> 07: core 405 MHz shader 810 MHz memory 324 MHz
> 0f: core 600 MHz shader 1460 MHz memory 800 MHz
> AC: core 405 MHz shader 810 MHz memory 324 MHz

try

echo 07 > /sys/kernel/debug/dri/0/pstate

This should try to normalize the parameters while keeping the same perf level. You can also try 03 and 0f (for low and high pstates).
Comment 9 Adam Borowski 2016-04-06 21:37:22 UTC
Created attachment 122781 [details]
log with 07 pstate

With the 07 pstate, the lockup was different: chromium worked for a longish time (instead of causing a crash almost immediately).  When finally the lockup happened, the machine remained operative (sans display) -- but this might be randomness inherent in such failures.

I'll try 03.
Comment 10 Adam Borowski 2016-04-06 21:46:58 UTC
Nah, it looks like it was random luck: with 03 pstate it crashed immediately, without even a single message on the serial console.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.