Created attachment 100258 [details] kernel log via a serial console On my system, with nvidia GT215/240, nouveau causes random crashes on the order of 1 hour. Typically, there's a lockup followed by a reboot a few seconds later. With the proprietary driver, the system is stable. The crash happens on both old and new kernels, up to 3.15-rc.
Created attachment 100259 [details] another log, with DRM=debug TJK on IRC suggested to boot with: nouveau.debug=DRM=debug drm.debug=0xe Here's a log with these settings. This time, there was no delay between the lockup and reboot. The log contains some stack traces, but they apparently come from the serial console being unable to cope with big bursts of debug info, and thus are unrelated to the problem at hand.
Created attachment 122750 [details] syslog with kernel 4.6-rc2 With kernel 4.6-rc2, there's some new output, including a stack trace. The frequency of crashes seems to be lower, before it crashed reliably after no longer than an hour, here's the first and so far only crash after ~2 days of trying nouveau instead of proprietary.
Created attachment 122773 [details] a non-tainted dump (serial console) Found a way to trigger it: while on 4.6-rc2 nothing during normal work or GL stuff causes the crash anymore, something Chromium does always causes a crash nearly instantly. So here's a non-tainted serial console dump.
Was there anything before the tlb flush timeout error? Unfortunately I have no clue why those timeouts happen... basically it's an indication that the GPU is "done". You could disable GL stuff for chromium, either launching it with LIBGL_ALWAYS_SOFTWARE=1 or by disabling stuff in about:flags. Perhaps the memory comes up in a (slightly) funny state? You could try reclocking it and see if that improves matters, have a look in /sys/kernel/debug/dri/0/pstate for the available perf levels, you can echo those values in to change states.
Created attachment 122779 [details] including boot with nouveau.debug=DRM=debug drm.debug=0xe > Was there anything before the tlb flush timeout error? Nothing since boot. I don't know what messages are interesting to you, so here's a dump since boot, this time with nouveau.debug=DRM=debug drm.debug=0xe > Unfortunately I have no clue why those timeouts happen... basically it's an indication that the GPU is "done". I believe my hardware itself is ok, at least as in "never had any issue with the proprietary driver". Those bastards dropped support for GT215 though... > You could disable GL stuff for chromium, either launching it with LIBGL_ALWAYS_SOFTWARE=1 or by disabling stuff in about:flags. I don't care about chromium, I got a better browser :p But whatever GL calls it does can be invoked by some other program later... > Perhaps the memory comes up in a (slightly) funny state? You could try reclocking it and see if that improves matters, have a look in /sys/kernel/debug/dri/0/pstate for the available perf levels, you can echo those values in to change states. What values would you suggest? I'm afraid I did not find documentation that's idiot-proof -- and I sadly got exactly 0 clue about this kind of stuff so I require some handholding to give you useful debug info.
(In reply to Adam Borowski from comment #5) > > Unfortunately I have no clue why those timeouts happen... basically it's an indication that the GPU is "done". > > I believe my hardware itself is ok, at least as in "never had any issue with > the proprietary driver". Those bastards dropped support for GT215 though... Yeah, I'm sure your HW is fine. But it just stops responding (reasonably) to nouveau when this happens, and we don't know what to do about it. Perhaps this is "normal" and the blob drivers know how to kick it in that case. Or perhaps we're doing something wrong to push the hw over the brink of sanity. > > Perhaps the memory comes up in a (slightly) funny state? You could try reclocking it and see if that improves matters, have a look in /sys/kernel/debug/dri/0/pstate for the available perf levels, you can echo those values in to change states. > > What values would you suggest? I'm afraid I did not find documentation > that's idiot-proof -- and I sadly got exactly 0 clue about this kind of > stuff so I require some handholding to give you useful debug info. cat /sys/kernel/debug/dri/0/pstate I can provide more instructions when you give me the output of that.
03: core 135 MHz shader 270 MHz memory 135 MHz 07: core 405 MHz shader 810 MHz memory 324 MHz 0f: core 600 MHz shader 1460 MHz memory 800 MHz AC: core 405 MHz shader 810 MHz memory 324 MHz
(In reply to Adam Borowski from comment #7) > 03: core 135 MHz shader 270 MHz memory 135 MHz > 07: core 405 MHz shader 810 MHz memory 324 MHz > 0f: core 600 MHz shader 1460 MHz memory 800 MHz > AC: core 405 MHz shader 810 MHz memory 324 MHz try echo 07 > /sys/kernel/debug/dri/0/pstate This should try to normalize the parameters while keeping the same perf level. You can also try 03 and 0f (for low and high pstates).
Created attachment 122781 [details] log with 07 pstate With the 07 pstate, the lockup was different: chromium worked for a longish time (instead of causing a crash almost immediately). When finally the lockup happened, the machine remained operative (sans display) -- but this might be randomness inherent in such failures. I'll try 03.
Nah, it looks like it was random luck: with 03 pstate it crashed immediately, without even a single message on the serial console.
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/xorg/driver/xf86-video-nouveau/issues/110.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.