Bug 67315

Summary:

[NVAA] System lockup with X consuming all CPU

Product:

xorg

Reporter:

Rick Stevens <ricks>

Component:

Driver/nouveau

Assignee:

Nouveau Project <nouveau>

Status:

RESOLVED MOVED

QA Contact:

Xorg Project Team <xorg-team>

Severity:

normal

Priority:

medium

Version:

unspecified

Hardware:

x86-64 (AMD64)

OS:

Linux (All)

Whiteboard:

i915 platform:

i915 features:

Attachments:

Description	Flags
Xorg.0.log with modeset enabled and output of dmesg while problem is active	none
20-nouveau.conf, Xorg.0.log and dmesg output	none
BUG_ON(!data)	none
More file snapshots	none
Logs from 3.9.9-302.fc19.x86_64 kernel	none
mmio trace along with dmesg and Xorg logs	none
Add some writes to 0x100c14	none

Description Rick Stevens 2013-07-25 17:59:11 UTC

Created attachment 83002 [details]
Xorg.0.log with modeset enabled and output of dmesg while problem is active

When using nouveau with modeset, the system starts up fine but eventually (possibly when the screen saver kicks in), X consumes all CPU resources on one core rendering the machine useless except via ssh sessions. If I disable modeset at boot time (nouveau.modeset=0), then the system is stable but the drive does not accept the EDID from the monitor and I'm stuck with a 1024x768 display instead of the 1600x900 it has with modeset enabled.

I'm attaching the Xorg.0.log file along with an excerpt from the output of dmesg when this behavior is occurring.

Comment 1 Rick Stevens 2013-07-25 18:01:06 UTC

Forgot to mention, this is Fedora 19, fully updated on a dual-core AMD-based system. Video hardware is:

02:00.0 VGA compatible controller: NVIDIA Corporation C77 [GeForce 8200] (rev a2) (prog-if 00 [VGA controller])
	Subsystem: NVIDIA Corporation Device cb84
	Flags: bus master, fast devsel, latency 0, IRQ 5
	Memory at fb000000 (32-bit, non-prefetchable) [size=16M]
	Memory at d8000000 (64-bit, prefetchable) [size=128M]
	Memory at e6000000 (64-bit, prefetchable) [size=32M]
	I/O ports at bc00 [size=128]
	[virtual] Expansion ROM at e0000000 [disabled] [size=128K]
	Capabilities: [60] Power Management version 2
	Capabilities: [68] MSI: Enable- Count=1/1 Maskable- 64bit+

Comment 2 Emil Velikov 2013-07-26 00:01:56 UTC

Hi Rick

Before jumping to compulsions can you configure your Xserver to not load
* libglamoregl.so
* vesa_drv.so
* modesetting_drv.so

The last two should be ignored if you use an xorg.conf as shown in here [1]. Whereas for the first one you could try an temporary rename the file (there may be a better way to handle it, which I'm not aware of).

With that said, try and narrow down what's causing the issue. A scenario such as
* as soon as my screensaver kicks in
* using program XX
* doing action A, while using program YY, followed by action C.

As the issue occurs, try and attach the complete output of dmesg (or save the output to a file and attach at a later point)

Cheers
Emil

[1] "Configuring the X server" http://nouveau.freedesktop.org/wiki/InstallNouveau/

Comment 3 Rick Stevens 2013-07-26 18:09:10 UTC

On 07/25/2013 05:01 PM, bugzilla-daemon@freedesktop.org issued this missive:
> *Comment # 2 <https://bugs.freedesktop.org/show_bug.cgi?id=67315#c2> on
> bug 67315 <https://bugs.freedesktop.org/show_bug.cgi?id=67315> from Emil
> Velikov <mailto:emil.l.velikov@gmail.com> *
>
> Hi Rick
>
> Before jumping to compulsions can you configure your Xserver to not load
> * libglamoregl.so
> * vesa_drv.so
> * modesetting_drv.so
>
> The last two should be ignored if you use an xorg.conf as shown in here [1].
> Whereas for the first one you could try an temporary rename the file (there may
> be a better way to handle it, which I'm not aware of).

There is no xorg.conf file. This is a fresh, virgin install of Fedora
19, then updated.

> With that said, try and narrow down what's causing the issue. A scenario such
> as
> * as soon as my screensaver kicks in
> * using program XX
> * doing action A, while using program YY, followed by action C.
>
> As the issue occurs, try and attach the complete output of dmesg (or save the
> output to a file and attach at a later point)

I stated that it appears to occur when the screen saver starts and I did
include the dmesg data that started to spew out when the problem occurs.
That being said, I will restart everything and disable the screen saver
completely.

----------------------------------------------------------------------
- Rick Stevens, Systems Engineer, AllDigital    ricks@alldigital.com -
- AIM/Skype: therps2        ICQ: 22643734            Yahoo: origrps2 -
-                                                                    -
-    When you don't know what to do, walk fast and look worried.     -
----------------------------------------------------------------------

Comment 4 Emil Velikov 2013-07-26 18:30:10 UTC

(In reply to comment #3)
> 
> There is no xorg.conf file. This is a fresh, virgin install of Fedora
> 19, then updated.
> 

AFAICS I've asked you to try using an xorg.conf, rather than "Are you using one how?" :P

> 
> I stated that it appears to occur when the screen saver starts and I did
> include the dmesg data that started to spew out when the problem occurs.
> That being said, I will restart everything and disable the screen saver
> completely.
> 

You mentioned "possibly when the screen saver kicks in", which indicates a wild guess :P With that confirmed which screen saver you're talking about - bubble3d, etc.

There is a reason I've requested dmesg output, the one you attached is at rather unknown point with absolutely no information as to how nouveau handles your card :\. That's why I've mentioned a _complete_ dmesg. If you feel there is something confidential inside if clearly strip it out

Comment 5 Rick Stevens 2013-07-26 18:52:10 UTC

Created attachment 83046 [details]
20-nouveau.conf, Xorg.0.log and dmesg output

Comment 6 Emil Velikov 2013-07-26 20:34:59 UTC

Now this is interesting

[19.600630]
Initially the kernel "drm" seems to pass NULL pointer to the card, while doing nv50_fbcon_imageblit()

ch 1 [0x0007cb0000 DRM] subc 3 class 0x502d mthd 0x0860 data 0x00000000

The way I see it, after this point we're on the mercy of the hardware

[19.627513]
X/the ddx
MP is still funny and fails to execute/set EDGEFLAG_ENABLE to 1 (which for the sake of me I cannot find in the ddx code?) 

[55.041877]
At this point the GPU is completely stuffed fails to execute/set
NV50_2D_BLIT_SRC_Y_INT, (coming from X).

After that hell breaks loose :P


Rick a few interesting notes
In the last attachment Xorg.log claims that it's starts at ~81.456, whereas dmesg states that X was running ~19.627541.
* Are those logs matching (ie. captured from the same boot/system startup)
I'm assuming that you've started your screensaver ~55.041877. Is that correct ?
* Do you recall when did your nouveau started started reporting errors (nouveau E) ?

Note: the following messages are harmless
nouveau E[    PBUS][0000:02:00.0] MMIO read of * FAULT at 0x1002**

Cheers
Emil

Comment 7 Rick Stevens 2013-07-26 21:41:35 UTC

On 07/26/2013 01:34 PM, bugzilla-daemon@freedesktop.org issued this missive:
> *Comment # 6 <https://bugs.freedesktop.org/show_bug.cgi?id=67315#c6> on
> bug 67315 <https://bugs.freedesktop.org/show_bug.cgi?id=67315> from Emil
> Velikov <mailto:emil.l.velikov@gmail.com> *
>
> Now this is interesting
>
> [19.600630]
> Initially the kernel "drm" seems to pass NULL pointer to the card, while doing
> nv50_fbcon_imageblit()
>
> ch 1 [0x0007cb0000 DRM] subc 3 class 0x502d mthd 0x0860 data 0x00000000
>
> The way I see it, after this point we're on the mercy of the hardware
>
> [19.627513]
> X/the ddx
> MP is still funny and fails to execute/set EDGEFLAG_ENABLE to 1 (which for the
> sake of me I cannot find in the ddx code?)
>
> [55.041877]
> At this point the GPU is completely stuffed fails to execute/set
> NV50_2D_BLIT_SRC_Y_INT, (coming from X).
>
> After that hell breaks loose :P

Emil, thanks for looking at this. It's, uhm, interesting, isn't it?

For clarity sake, this is a bit of an odd system. It's a Shuttle 
motherboard. From "dmidecode":

Handle 0x0002, DMI type 2, 8 bytes
Base Board Information
	Manufacturer: Shuttle Inc
	Product Name: FN78S
	Version: V10
	Serial Number:

> Rick a few interesting notes
> In the last attachment Xorg.log claims that it's starts at ~81.456, whereas
> dmesg states that X was running ~19.627541.
> * Are those logs matching (ie. captured from the same boot/system startup)
> I'm assuming that you've started your screensaver ~55.041877. Is that correct ?
> * Do you recall when did your nouveau started started reporting errors (nouveau
> E) ?

I booted the machine and as soon as the display locked up, I ssh'd to it
from another machine and simply did a "dmesg >/rick/dmesg.txt" to
capture as much as I could. The Xorg.0.log was already generated by the
time I logged in via ssh. I combined the various logs into one file and
shot it off to bugzilla.

As far as the screen saver, I'm running XFCE and the lockup occurred as
soon as I went through the "Applications Menu->Settings->Screensaver"
menu tree. It tried to render the Screensaver window, got as far as
drawing the box around it and everything locked up. After about a
minute, the graphical screen blanked and I started getting the GPU
lockup and going back to fbcon messages appeared on the console. The
GUI screen then reappeared but it was still locked up. Again it
cleared, the GPU lockup message appeared again and around and around we
go. It might do the same thing rendering other windows, but I'm trying
to be consistent to help debug this.

I'll be more than happy to keep tinkering, but I've had like 3 hours
of sleep in the last two days and I'm a bit knackered. I can pick this
back up Monday. I'm in California if that's of any help.

> Note: the following messages are harmless
> nouveau E[    PBUS][0000:02:00.0] MMIO read of * FAULT at 0x1002**

Glad to hear that! That's the first thing that appears after the
plymouth "fill the bubble" screen and before I get the GDM login.
----------------------------------------------------------------------
- Rick Stevens, Systems Engineer, AllDigital    ricks@alldigital.com -
- AIM/Skype: therps2        ICQ: 22643734            Yahoo: origrps2 -
-                                                                    -
-     Squawk!  Pieces of Seven!  Pieces of Seven!  Parity Error!     -
----------------------------------------------------------------------

Comment 8 Rick Stevens 2013-07-29 16:31:04 UTC

I'm ready to continue sorting this out. Enjoyed adequate sleep over the weekend so I'm rarin' to go.

What else do you need from me?

Comment 9 Emil Velikov 2013-07-29 16:48:45 UTC

(In reply to comment #8)
> I'm ready to continue sorting this out. Enjoyed adequate sleep over the
> weekend so I'm rarin' to go.
> 
> What else do you need from me?

Answers to the last two questions would be nice as a start :P

(In reply to comment #6)
> In the last attachment Xorg.log claims that it's starts at ~81.456, whereas
> dmesg states that X was running ~19.627541.
> * Are those logs matching (ie. captured from the same boot/system startup)
> I'm assuming that you've started your screensaver ~55.041877. Is that
> correct ?
I.e. understanding the timeline would be beneficial

> * Do you recall when did your nouveau started started reporting errors
> (nouveau E) ?
> 
I'm talking about nouveau errors in general, indicated by "nouveau E" disregarding the I've mentioned as harmless. If unsure attach dmesg of older kernels that you have handy :)

Comment 10 Emil Velikov 2013-07-29 17:08:48 UTC

Created attachment 83220 [details] [review]
BUG_ON(!data)

Also you can try this simple patch, it would trigger a kernel bug so be warned.
If/when it does capture the output and attach it in here (either in picture or text form)
Thanks

Comment 11 Rick Stevens 2013-07-29 20:30:49 UTC

(In reply to comment #9)
> (In reply to comment #8)
> > I'm ready to continue sorting this out. Enjoyed adequate sleep over the
> > weekend so I'm rarin' to go.
> > 
> > What else do you need from me?
> 
> Answers to the last two questions would be nice as a start :P

In answer to your first question (I think), yes, those captures were from the same boot/crash cycle, acquired as I mentioned in my comments.

In answer to the second, I'm not sure how to correlate the timestamps. Looking at the dmesg output, the first "nouveau E" message occurred 19 seconds into the boot process. This is long before the login screen for the desktop appears.

The next messages seem to occur once I log in via the desktop login screen, about 55 seconds after the boot. The screensaver shouldn't start for at least 10 minutes after that. As I said, if I try to do something such as adjust the screen saver parameters via XFCE's "Applications Menu->Settings->Screensaver" menu, as soon as the frame containing the the applications window is rendered the display, keyboard and mouse lock up.

> 
> (In reply to comment #6)
> > In the last attachment Xorg.log claims that it's starts at ~81.456, whereas
> > dmesg states that X was running ~19.627541.
> > * Are those logs matching (ie. captured from the same boot/system startup)
> > I'm assuming that you've started your screensaver ~55.041877. Is that
> > correct ?
> I.e. understanding the timeline would be beneficial
> 
> > * Do you recall when did your nouveau started started reporting errors
> > (nouveau E) ?
> > 
> I'm talking about nouveau errors in general, indicated by "nouveau E"
> disregarding the I've mentioned as harmless. If unsure attach dmesg of older
> kernels that you have handy :)

dmesg is wiped each time I boot the machine. I guess I could try to set up something that would log it to a disk file instead of just in the ring buffer.

Comment 12 Rick Stevens 2013-07-29 20:43:41 UTC

(In reply to comment #10)
> Created attachment 83220 [details] [review] [review]
> BUG_ON(!data)
> 
> Also you can try this simple patch, it would trigger a kernel bug so be
> warned.
> If/when it does capture the output and attach it in here (either in picture
> or text form)
> Thanks

I'll need to free up some time this week to build a new kernel. I'll put it on the "got to do" list.

Comment 13 Emil Velikov 2013-07-29 21:28:57 UTC

(In reply to comment #11)
> In answer to your first question (I think), yes, those captures were from
> the same boot/crash cycle, acquired as I mentioned in my comments.
> 
Obviously I did not express myself clear enough :)

The Xorg.0.log attached is _not_ from when the crash/lockup occurred. It is from the second X session (the one that was started by your system after X died the first time). I was seeking confirmation of the above with my first question :P

Does your system has plymouth or similar splash manager ? Please disable it, try to reproduce the issue and attach dmesg _before_ trying the patch/recompiling the kernel

[...]
> adjust the screen saver parameters via XFCE's "Applications
> Menu->Settings->Screensaver" menu, as soon as the frame containing the the
> applications window is rendered the display, keyboard and mouse lock up.
> 
Please do not assume that I'm running XFCE and/or know what exactly it does behind the scenes. The name and/or package of the screensaver would be great :)

[...]
> dmesg is wiped each time I boot the machine. I guess I could try to set up
> something that would log it to a disk file instead of just in the ring
> buffer.
No need for such an overkill

All I was asking - "Is this a kernel regression ?", or in other words
"before you installed Fedora 19 with kernel 3.9.9, you had XXX, running kernel XXX. Do you recall any messages similar to 'nouveau E' in dmesg?"

Quick $grep "nouveau E" -r /var/logs/ may help depending on your setup

Cheers
Emil

Comment 14 Rick Stevens 2013-07-30 00:16:25 UTC

Created attachment 83247 [details]
More file snapshots

Comment 15 Rick Stevens 2013-07-30 00:20:55 UTC

Sorry, forgot to comment about the attachment I just put up.

The machine did use plymouth, so I removed the "rhgb" stuff from the boot command and repeated everything as before, but included both Xorg.0.log[.old] files and another capture of dmesg.

The screensaver is "xscreensaver" and the menu tree fires up "xscreensaver-demo".
They come from the xscreensaver-5.22.1.fc19.x86_64 set of RPMs. The system is up-to-date running kernel 3.10.3-300.fc19.x86_64.

Comment 16 Rick Stevens 2013-07-30 00:39:43 UTC

Oh, one additional thing...nouveau never seemed to work with this hardware. Before I upgraded this machine to F19, it had been running F18 but I ended up running the nvidia blob driver system since I never could get nouveau to work. I could go back to that if needed, but I thought that since this is currently an experimental system, I might try to get nouveau to work properly and possibly help the community.

I do appreciate your help on this Emil. I just wanted you to know that and if I seem a bit, well, dense it's because I'm not that familiar with how the entire X mechanism works on this sort of hardware.

Comment 17 Emil Velikov 2013-07-30 18:44:51 UTC

(In reply to comment #16)
> Oh, one additional thing...nouveau never seemed to work with this hardware.
Ouch, now that is interesting :)

> I do appreciate your help on this Emil. I just wanted you to know that and
> if I seem a bit, well, dense it's because I'm not that familiar with how the
> entire X mechanism works on this sort of hardware.
We all have our strengths and weaknesses :)

Looking at your latest dmesg, the pointer have changed from 0 to another (probably invalid) value - 0xfe00ccc6

nouveau E[  PGRAPH][0000:02:00.0] ch 1 [0x0007cb0000 DRM] subc 3 class 0x502d mthd 0x0860 data 0xfe00ccc6

Would be great if you can do the following

* Notice if the the value varies between reboots, or does it differ on kernel version

example
       kernel version
       3.9.9   3.10.5
boot
1      0xbla0  0xbla1
2      0xbla2  0xbla3
3      0xbla4  0xbla5

Comment 18 Rick Stevens 2013-07-30 20:12:42 UTC

Attached is the same data from kernel 3.9.9-302.fc19.x86_64.

Comment 19 Rick Stevens 2013-07-30 20:13:29 UTC

Created attachment 83330 [details]
Logs from 3.9.9-302.fc19.x86_64 kernel

Comment 20 Emil Velikov 2013-07-31 15:25:03 UTC

Rick there is no need to try the BUG_ON(!data).patch

I was assuming that the data printed by nouveau indicates the CPU pointer(rather than one from the gpu's vm), silly me.

That said, the only thing that I can think of is that either the VM is not setup properly or data is getting overwritten somewhere :\

Either case a mmiotrace of the blob may be great to have [1], although not too sure when I'll have some time to look into it

Cheers
Emil

[1] http://nouveau.freedesktop.org/wiki/MmioTrace/
https://wiki.ubuntu.com/X/MMIOTracing - FOR ACCELERATION BUGS

Comment 21 Rick Stevens 2013-07-31 16:58:57 UTC

Created attachment 83383 [details]
mmio trace along with dmesg and Xorg logs

I did as you asked...rebooted in run level 3 with plymouth disabled. Started the mmio trace, then started X by using "telinit 5", tried to bring up the screen saver stuff, waited for it to lock up, waited until it restarted, then stopped the mmio trace and grabbed dmesg and the last two Xorg log files.

I hope that gives you more insight. I await your feedback.

Comment 22 Ilia Mirkin 2013-08-30 21:06:11 UTC

Did you mmiotrace nouveau or the nvidia proprietary driver? The request was for the latter. [Although I don't really see how that will help... but I also haven't looked at all the details of this bug.]

It would be worth testing 3.11-rc7 on this (or 3.11 if it's out by the time you get to it).

I wonder if there's a bug somewhere handling the NVAA as NVA3+ (which it's not). But then NVAC would also be affected... or something in the ctxprogs... (I guess for which the mmiotrace could come in handy.)

Comment 23 Pierre Moreau 2014-09-25 12:01:06 UTC

Created attachment 106849 [details] [review]
Add some writes to 0x100c14

Does this patch help (applying it to a recent kernel code would be better)?

Comment 24 Pierre Moreau 2015-01-18 20:13:19 UTC

Is this still an issue using kernel 3.19-rc4?

Comment 25 poma 2015-01-19 12:18:14 UTC

(In reply to Pierre Moreau from comment #24)
> Is this still an issue using kernel 3.19-rc4?

Questionable kernel version is not only,
Fedora 19 End of Life
https://lists.fedoraproject.org/pipermail/announce/2015-January/003248.html

Comment 26 Pierre Moreau 2015-01-19 13:18:37 UTC

(In reply to poma from comment #25)
> (In reply to Pierre Moreau from comment #24)
> > Is this still an issue using kernel 3.19-rc4?
> 
> Questionable kernel version is not only,
> Fedora 19 End of Life
> https://lists.fedoraproject.org/pipermail/announce/2015-January/003248.html

The fact that he had F19 when he opened the bug report doesn't imply that he didn't upgrade to F20 or F21 since his last post.
Besides, one can compile a kernel, or test using one of the live images here: https://nouveau.pmoreau.org.

Comment 27 poma 2015-01-20 11:04:22 UTC

To help you, I left him a note on users@lists.fedoraproject.org to respond here.

Comment 28 Martin Peres 2019-12-04 08:34:49 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/xorg/driver/xf86-video-nouveau/issues/48.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.