Bug 74492 - [NV4E] v3.13-rc8 nv4e + msi = random lockups
Summary: [NV4E] v3.13-rc8 nv4e + msi = random lockups
Status: RESOLVED FIXED
Alias: None
Product: xorg
Classification: Unclassified
Component: Driver/nouveau (show other bugs)
Version: git
Hardware: Other All
: medium normal
Assignee: Nouveau Project
QA Contact: Xorg Project Team
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-02-04 07:56 UTC by Ronald
Modified: 2015-10-22 07:25 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
Full dmesg of a clean boot (49.03 KB, text/plain)
2014-02-04 07:56 UTC, Ronald
no flags Details
Screenshot of garbled OOPS (620.88 KB, image/jpeg)
2014-02-04 07:57 UTC, Ronald
no flags Details
Logs without msi (81.77 KB, text/plain)
2014-02-04 08:26 UTC, Ronald
no flags Details
Logs with msi (73.09 KB, text/plain)
2014-02-04 08:27 UTC, Ronald
no flags Details
Logs directly after corruption in tty (49.55 KB, text/plain)
2014-02-04 08:29 UTC, Ronald
no flags Details
Partial output from hang with MSI enabled (20.22 KB, text/plain)
2014-02-20 08:36 UTC, Ronald
no flags Details

Description Ronald 2014-02-04 07:56:19 UTC
Created attachment 93341 [details]
Full dmesg of a clean boot

Kind of looks like bug #73445.

For me, it happens:

- almost at random with intervals of hours ranging to days (dammit...)
- not (yet) in OpenGL apps (I don't play games anymore)
- hangs in firefox , evince
- happens while scrolling but also happens during 'stationary' display

The symptoms are:

- relatively small corruption (10 %) which changes over time somewhat slightly
- disk writes only happen in periodic spurs with long pauses (SysRQ + S) did not help

Eventually I'm dumped to terminal with an oops, but I don't see anything nouveau related in there. Attaching it anyway.

nouveau.config=NvMSI=0 'fixes' it. Or let me put it this way: It never occurred while using this parameter.
Comment 1 Ronald 2014-02-04 07:57:54 UTC
Created attachment 93342 [details]
Screenshot of garbled OOPS
Comment 2 Ronald 2014-02-04 07:59:09 UTC
I forgot to rotate the photo, but it's also unreadable. Sorry about that.

Oopses occur even after the switch to TTY. So this is is probably not even the first one.
Comment 3 Ilia Mirkin 2014-02-04 08:14:24 UTC
Yes, probably the same thing as that bug, although unfortunately the bug filer from #73445 never responded to my suggestion.

Definitely sounds like something IRQ-ish is going bad. I think the ultimate resolution will be to just disable MSI on nv4e, but if you don't mind, could you attach the output of

# lspci -vvvnn
# cat /proc/interrupts

(Run as root -- well, really just the lspci bit needs root.)

for both the NvMSI=0 case as well as the default case. Even more ideally, do this when the corruption begins (for when MSI is enabled). Also, can you attach a dmesg of a boot without NvMSI=0? Do you see funny interrupt errors in dmesg (like "irq 16 nobody cared" sort of thing)?
Comment 4 Ronald 2014-02-04 08:26:41 UTC
Created attachment 93343 [details]
Logs without msi
Comment 5 Ronald 2014-02-04 08:27:00 UTC
Created attachment 93344 [details]
Logs with msi
Comment 6 Ronald 2014-02-04 08:29:38 UTC
Created attachment 93345 [details]
Logs directly after corruption in tty

I managed to reproduce it. I think it's reliable, but that will have to wait. I have to go in 30 minutes.

I was able to reproduce it by scrolling like a madmen in tty2. It was just a guess =) .

However, nouveau seems to be the quiet type:

[  122.487034] nouveau E[     DRM] GPU lockup - switching to software fbcon

log_msi=1.txt is the logs without corruption
corrupt.txt is the logs with corruption
log_msi=0.txt is a clean boot
Comment 7 Ronald 2014-02-04 13:13:26 UTC
I don't see messages about irq's and all.

The nice thing about the tty lockup is that the system is actually still responsive.

However, reproducing is sometimes difficult. I also checked if it triggered under NvMSI=0 (in case this is a separate issue) and I was not able to reproduce it.

Any pointers?
Comment 8 Ilia Mirkin 2014-02-05 19:35:10 UTC
Looks like the nv4x igp's have some registers placed differently... I just cc'd you on a few patches, you can also get them at:

http://lists.freedesktop.org/archives/nouveau/2014-February/016032.html
http://lists.freedesktop.org/archives/nouveau/2014-February/016033.html
http://lists.freedesktop.org/archives/nouveau/2014-February/016034.html
Comment 9 Ronald 2014-02-05 19:35:59 UTC
I noticed, thanks!
Comment 10 Ilia Mirkin 2014-02-16 06:07:22 UTC
(In reply to comment #9)
> I noticed, thanks!

Were you able to test them out? Would be nice to confirm if they actually work as advertised. Although I do tend to trust mwk on such things :)
Comment 11 Ronald 2014-02-16 08:11:38 UTC
It's on my TODO list together with bug #70213. The laptop is being used right now, so this could take a while.
Comment 12 Ronald 2014-02-16 17:14:53 UTC
Patches applied, let's hope for the best.

Furthermore, hibernate seems to work with this laptop! After the enablement of the kernel DRM for the nouveau driver this was not the case anymore. Up until now... progress!
Comment 13 Ronald 2014-02-16 19:41:34 UTC
It went berserk shortly after the last post. I have the laptop hooked up to another machine with netconsole.

Last time it gave so much OOPSES BUGS and WARNS, well that should keep you all busy for the next week :) .

Bug is hard to reproduce, this might take a while. It happens mostly while scrolling. Starts with parts of the screen not updating keeping stale contents of before the scroll.
Comment 14 Ronald 2014-02-19 10:15:41 UTC
It does not seem to send the errors over netconsole. It crashed twice this morning using a git pull from 2 hours ago. No output :/ . I'll keep trying though.
Comment 15 Ilia Mirkin 2014-02-19 18:11:02 UTC
Is the system solid without MSI? Maybe we should just give up and disable MSI on it. NVIDIA never shipped drivers with MSI enabled for pre-nv50.
Comment 16 Ronald 2014-02-20 07:22:03 UTC
Yes, it never crashed with MSI disabled. I have the laptop hooked up anyway so I will still try hoping it will be able to send out what is going wrong.
Comment 17 Ronald 2014-02-20 08:36:43 UTC
Created attachment 94413 [details]
Partial output from hang with MSI enabled

I managed to capture some output during a hang. It's not much.

The music stops playing when the pfifo warnings show up. Then it starts playing and stops at the next pfifo warning.

Should I enable more verbose nouveau logs?
Comment 18 Ilia Mirkin 2015-10-22 07:25:42 UTC
(In reply to Ilia Mirkin from comment #15)
> Is the system solid without MSI? Maybe we should just give up and disable
> MSI on it. NVIDIA never shipped drivers with MSI enabled for pre-nv50.

MSI should be disabled for NV4C and NV4E in semi-recent kernels.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.