23593 – nouveau hangs after a few minutes

Bug 23593 - nouveau hangs after a few minutes

Summary: nouveau hangs after a few minutes

Status:	RESOLVED FIXED

Alias:	None

Product:	xorg
Classification:	Unclassified
Component:	Driver/nouveau (show other bugs)
Version:	unspecified
Hardware:	x86 (IA32) Linux (All)

Importance:	medium normal
Assignee:	Nouveau Project
QA Contact:	Xorg Project Team

URL:
Whiteboard:
Keywords:

Duplicates (1):	23086 (view as bug list)
Depends on:
Blocks:

Reported:	2009-08-30 05:47 UTC by aeriksson
Modified:	2009-09-23 11:25 UTC (History)
CC List:	2 users (show)

See Also:
i915 platform:
i915 features:

Attachments
log from boot to hang (powered off) (19.78 KB, application/x-gzip) 2009-08-30 07:17 UTC, aeriksson	no flags	Details
new log. boot to hang part1 (1000.00 KB, application/octet-stream) 2009-09-02 14:13 UTC, aeriksson	no flags	Details
part2 (1000.00 KB, application/octet-stream) 2009-09-02 14:14 UTC, aeriksson	no flags	Details
part3 (1000.00 KB, application/octet-stream) 2009-09-02 14:15 UTC, aeriksson	no flags	Details
part4/4 (176.55 KB, application/octet-stream) 2009-09-02 14:15 UTC, aeriksson	no flags	Details
Xorg.0.log (17.87 KB, application/octet-stream) 2009-09-03 13:34 UTC, aeriksson	no flags	Details
40 min boot to crash log (308.66 KB, application/gzip) 2009-09-05 05:52 UTC, aeriksson	no flags	Details
hang log from master (78.38 KB, application/x-gzip) 2009-09-10 05:10 UTC, aeriksson	no flags	Details
fence_emit_race.patch (783 bytes, patch) 2009-09-14 05:01 UTC, Francisco Jerez	no flags	Details \| Splinter Review
View All

Description aeriksson 2009-08-30 05:47:23 UTC

recent git and the screen freezes after launching X. 

When just running a few xterms, it works ok for a while then it freezes. The mouse continues to move ok, and you can still ssh to the box etc.

It _seems_ the freeze can be triggered by starting firefox.

This is on kernel 2.6.31-rc6

Comment 1 Maarten Maathuis 2009-08-30 06:27:09 UTC

Logs are missing.

Comment 2 aeriksson 2009-08-30 07:17:09 UTC

Created attachment 29025 [details]
log from boot to hang (powered off)

And here is the log

Comment 3 Pekka Paalanen 2009-08-30 11:46:32 UTC

From your log:

Aug 30 12:10:02 tippex nvidiafb: EDID found from BUS1
Aug 30 12:10:02 tippex nvidiafb: Using CRT on CRTC 0
Aug 30 12:10:02 tippex nvidiafb: MTRR set to ON
Aug 30 12:10:02 tippex nvidiafb: PCI nVidia NV2 framebuffer (16MB @ 0xE2000000)

nvidiafb is bound to give problems. You must get rid of it.
Reference: http://nouveau.freedesktop.org/wiki/FAQ  Troubleshooting
See also the KMS troubleshooting related items if you try KMS.

Comment 4 aeriksson 2009-09-01 11:21:29 UTC

Douh. Sorry about that. Removing the nvidiafb modules does make it better (and I get kms!!).

However, leaving the machine for the night locks the screensaver. Wiggling the mouse/kbd doesn't light up the screen. The machine continues to route packets though.

I get a bunch of these in the log:

Sep  1 19:11:13 tippex [drm:drm_ioctl], pid=2376, cmd=0x40086485, nr=0x85, dev 0xe200, auth=1
Sep  1 19:11:13 tippex [drm:drm_ioctl], ret = fffffff5
Sep  1 19:11:13 tippex [drm:drm_ioctl], pid=2376, cmd=0x40086485, nr=0x85, dev 0xe200, auth=1
Sep  1 19:11:13 tippex [drm:drm_ioctl], ret = fffffff5
Sep  1 19:11:13 tippex [drm:drm_ioctl], pid=2376, cmd=0x40086485, nr=0x85, dev 0xe200, auth=1
Sep  1 19:11:13 tippex [drm:drm_ioctl], ret = fffffff5
Sep  1 19:11:13 tippex [drm:drm_ioctl], pid=2376, cmd=0x40086485, nr=0x85, dev 0xe200, auth=1
Sep  1 19:11:13 tippex [drm:drm_ioctl], ret = fffffff5


Do you want the entire log from boot?

Comment 5 Pekka Paalanen 2009-09-01 11:53:52 UTC

(In reply to comment #4)
> Douh. Sorry about that. Removing the nvidiafb modules does make it better (and
> I get kms!!).

Cool.

> However, leaving the machine for the night locks the screensaver. Wiggling the
> mouse/kbd doesn't light up the screen. The machine continues to route packets
> though.

Which screensaver is that, does it run an animation?

> I get a bunch of these in the log:
> 
> Sep  1 19:11:13 tippex [drm:drm_ioctl], pid=2376, cmd=0x40086485, nr=0x85, dev
> 0xe200, auth=1
> Sep  1 19:11:13 tippex [drm:drm_ioctl], ret = fffffff5

This is still DRM_NOUVEAU_GEM_CPU_PREP returning EAGAIN. This is a sort of a soft lockup, where the graphics memory manager gets stuck. Well, assuming it gets stuck here, a bunch of those might appear even without a bug, I guess.

> Do you want the entire log from boot?

That would be nice, yes.

Comment 6 aeriksson 2009-09-02 05:41:40 UTC

(In reply to comment #5)
> (In reply to comment #4)
> > Douh. Sorry about that. Removing the nvidiafb modules does make it better (and
> > I get kms!!).
> 
> Cool.
> 
> > However, leaving the machine for the night locks the screensaver. Wiggling the
> > mouse/kbd doesn't light up the screen. The machine continues to route packets
> > though.
> 
> Which screensaver is that, does it run an animation?
> 
The in-kernel one I guess. The led on the display goes to "sleep" mode and the display powers off . Waiting a long time (e.g. over night) triggers it. I have DPMS to sleep/off after 10 minutes, and these dpms induced sleeps can be undone by kbd/mouse. It's the long sleep sessions (hours) that tend to lock it up.  

I have no traditional singing-dancing screensaver in use.



> > I get a bunch of these in the log:
> > 
> > Sep  1 19:11:13 tippex [drm:drm_ioctl], pid=2376, cmd=0x40086485, nr=0x85, dev
> > 0xe200, auth=1
> > Sep  1 19:11:13 tippex [drm:drm_ioctl], ret = fffffff5
> 
> This is still DRM_NOUVEAU_GEM_CPU_PREP returning EAGAIN. This is a sort of a
> soft lockup, where the graphics memory manager gets stuck. Well, assuming it
> gets stuck here, a bunch of those might appear even without a bug, I guess.
> 
I get about 100 roundtrips of them each second, and nothing else. Normal?



> > Do you want the entire log from boot?
> 
> That would be nice, yes.
> 
I'll post that tonight.

Comment 7 aeriksson 2009-09-02 14:13:38 UTC

Created attachment 29120 [details]
new log. boot to hang part1

The tool forces me to split up the log...

Comment 8 aeriksson 2009-09-02 14:14:26 UTC

Created attachment 29121 [details]
part2

Comment 9 aeriksson 2009-09-02 14:15:04 UTC

Created attachment 29122 [details]
part3

Comment 10 aeriksson 2009-09-02 14:15:39 UTC

Created attachment 29123 [details]
part4/4

Comment 11 Pekka Paalanen 2009-09-03 10:49:48 UTC

Argh. I looked only at the first part. The hang happens at 22:41:24.

First, that log is not from boot, or then the kernel message buffer has overflown FAST. (Which might not be a surprise, since it is logging crtc register access and whatnot.)

Browsing the log, it is clear that the kernel message buffer repeatedly overflows. Not to mention that that log is a bitch to search through.

Okay, I'll see if I could limit the crtc etc. logging in Nouveau during the weekend. The ioctl flood is a bit more annoying.


The kernel screen saver is just blanking and DPMS and mouse does not wake it up (not sure if with gpm it would). AFAIK X disables the kernel screen saver. Do you have X running, or just nouveaufb virtual terminal? I'm asking again, since the information from when nvidiafb was enabled is useless. Was this nv05?

You probably do have X running since all the activity in the log. What apps are running there? Could you stop them (to get smaller logs), does it hang then?

It would be useful to repeat the still valid basic information here that you mentioned in emails to the list. People reading this bug don't know we discussed it on the list.

Comment 12 aeriksson 2009-09-03 13:31:35 UTC

(In reply to comment #11)
> Argh. I looked only at the first part. The hang happens at 22:41:24.
> 
> First, that log is not from boot, or then the kernel message buffer has
> overflown FAST. (Which might not be a surprise, since it is logging crtc
> register access and whatnot.)
> 
I was suprised by the order of the initial stuff too. 



> Browsing the log, it is clear that the kernel message buffer repeatedly
> overflows. Not to mention that that log is a bitch to search through.
> 
Any ideas how to get around the overflows?


> Okay, I'll see if I could limit the crtc etc. logging in Nouveau during the
> weekend. The ioctl flood is a bit more annoying.
> 
> 
ok
> The kernel screen saver is just blanking and DPMS and mouse does not wake it up
> (not sure if with gpm it would). AFAIK X disables the kernel screen saver. Do
> you have X running, or just nouveaufb virtual terminal? 

I'm always in X when it hangs. I cannot recall a non-X hang. In vt mode, I use gpm, it thatäs relevant.

I'm asking again, since
> the information from when nvidiafb was enabled is useless. Was this nv05?
> 

lspci says:
01:00.0 VGA compatible controller: nVidia Corporation NV5 [RIVA TNT2/TNT2 Pro] (rev 15)
but the nouveay and/or drm code speaks of nv04 on probing...? I'll attach an Xorg log too.


> You probably do have X running since all the activity in the log. 
yes.
What apps are
> running there? 
windowmaker, a couple of xterms, exmh (a tcl-tk app), sometimes firefox. I've seen hangs on x11 startup, at about the time exmh starts (ff starts manually)

Could you stop them (to get smaller logs), does it hang then?
I'll let it run overnight with no wm and no apps, and we'll see what happens.


> 
> It would be useful to repeat the still valid basic information here that you
> mentioned in emails to the list. People reading this bug don't know we
> discussed it on the list.
> 
Right. See above. Some hangs are in display-on mode, at the middle of operations. The mouse moves, but all windows freezes. Some other hangs are discovered e.g. after the night, when the display is off and wonät come back on on mouse/kbd.

In almost all cases, I can ssh etc to the machine just fine, and I've seen nothing interesting in any log apart from the EBUSYthing already talked about.

As a newbie to nouveau, I made one interesting observation: The X screen when starting X (even after warm-reboot) is initially painted with the contents of the last X screen. If I bog down the cpu with a cpu bound task (such as drm logging to syslog) I can watch the old contents for ca 5 seconds before it's cleared and the WM starts. That smells like a BIG security issue for public terminals. Shouldn't the GPU/fb memory be clearedbefore handed to the process (as RAM is when provided to a process?)

Comment 13 aeriksson 2009-09-03 13:34:11 UTC

Created attachment 29191 [details]
Xorg.0.log

Comment 14 Francisco Jerez 2009-09-04 10:26:37 UTC

(In reply to comment #12)
> lspci says:
> 01:00.0 VGA compatible controller: nVidia Corporation NV5 [RIVA TNT2/TNT2 Pro]
> (rev 15)
> but the nouveay and/or drm code speaks of nv04 on probing...? I'll attach an
> Xorg log too.
> 
> 

I've been trying to reproduce this for a while on another nv05 without success, I'm using master though. You said you were on master-compat but later that you were using a 2.6.31 release candidate so, which branch are you using exactly?

There might be an specific acceleration request deterministically triggering the lockup. Have you tried e.g. "$ x11perf -all"? It would be interesting to know if any of the tests it does makes your system hang.

> [...]
> As a newbie to nouveau, I made one interesting observation: The X screen when
> starting X (even after warm-reboot) is initially painted with the contents of
> the last X screen. If I bog down the cpu with a cpu bound task (such as drm
> logging to syslog) I can watch the old contents for ca 5 seconds before it's
> cleared and the WM starts. That smells like a BIG security issue for public
> terminals. Shouldn't the GPU/fb memory be clearedbefore handed to the process
> (as RAM is when provided to a process?)
> 

That could be easily done. In fact it's already done for the non-KMS case. I guess patches are welcome.

Comment 15 aeriksson 2009-09-05 05:52:27 UTC

Created attachment 29248 [details]
40 min boot to crash log

40 minutes run. Boot to hang.

The nouveau.ko module (and deps) was manually insmodded after boot, hence the initial rows in the log are more readable. There are still the occational overrun downstream though.

This time no screensaver was involved. The display was on for the entire run.

Comment 16 aeriksson 2009-09-05 10:35:21 UTC

> I've been trying to reproduce this for a while on another nv05 without success,
> I'm using master though. You said you were on master-compat but later that you
> were using a 2.6.31 release candidate so, which branch are you using exactly?
> 
vanilla 2.6.31-rc6 kernel

nouveau-drm-99999999
xf86-video-nouveau-9999
libdrm-9999

from gentoo's x11 overlay.

Looking inside the nouveau-drm-99999999 ebuild, I see that it pulls:
http://people.freedesktop.org/~pq/nouveau-drm/master-compat.tar.gz

The vanilla kernel has no nouveau stuff in it. The master-compat tarball builds fine against it. 


> There might be an specific acceleration request deterministically triggering
> the lockup. Have you tried e.g. "$ x11perf -all"? 
Runnig it now. I've seen one hang so far, but rebooting and running just that specific test failed to trigger it.

Just a thought...As the failure mode is "looping on ebusy", and it seems it's on the same (ioctl) call all the time, are there any debuggning things which can be enabled for that particular call? printf'ing the reason for the ebusy might give some lead...


> 
> That could be easily done. In fact it's already done for the non-KMS case. I
> guess patches are welcome.
> 
It seems I'm being lured into actuially looking at the code. :-) I might end up doing exactly that, but I cannot seem to find the time...

Comment 17 Francisco Jerez 2009-09-06 04:58:40 UTC

(In reply to comment #16)
> > I've been trying to reproduce this for a while on another nv05 without success,
> > I'm using master though. You said you were on master-compat but later that you
> > were using a 2.6.31 release candidate so, which branch are you using exactly?
> > 
> vanilla 2.6.31-rc6 kernel
> 
> nouveau-drm-99999999
> xf86-video-nouveau-9999
> libdrm-9999
> 
> from gentoo's x11 overlay.
> 
> Looking inside the nouveau-drm-99999999 ebuild, I see that it pulls:
> http://people.freedesktop.org/~pq/nouveau-drm/master-compat.tar.gz
> 
> The vanilla kernel has no nouveau stuff in it. The master-compat tarball builds
> fine against it. 
> 

This sounds like a master-compat issue then. Would you mind to confirm it is by installing the DRM from master?

Comment 18 aeriksson 2009-09-07 12:29:33 UTC

> This sounds like a master-compat issue then. Would you mind to confirm it is by
> installing the DRM from master?
> 

Running 2.6.31-rc6-g1889587 now. Let's see what happens.

Comment 19 aeriksson 2009-09-09 23:58:45 UTC

(In reply to comment #18)
> > This sounds like a master-compat issue then. Would you mind to confirm it is by
> > installing the DRM from master?
> > 
> 
> Running 2.6.31-rc6-g1889587 now. Let's see what happens.
> 

Same hangs with master, unfortunately. I'll try to get a sort log and post it. The local pattern (loop on ebusy) is the same though.

Any chance this is caused by anything outside nouveau? should I back down any of the other modules (libdrm, xf86-video-nouveau,...) to non-git versions?

Comment 20 aeriksson 2009-09-10 05:10:43 UTC

Created attachment 29381 [details]
hang log from master

Here's a log from a hang using 2.6.31-rc6-g1889587.

Effectively starting kernel without nouveau. insmodding it with modeset=1. xinit. manually start firefox. hang while ff starts.

Comment 21 Francisco Jerez 2009-09-14 05:01:23 UTC

Created attachment 29519 [details] [review]
fence_emit_race.patch

I think I've finally reproduced your problem (for some reason it seldom happens here, I guess you either have a faster card or a slower CPU). Does the attached patch help?

Comment 22 aeriksson 2009-09-15 23:43:15 UTC

(In reply to comment #21)
> Created an attachment (id=29519) [details]
> fence_emit_race.patch
> 
> I think I've finally reproduced your problem (for some reason it seldom happens
> here, I guess you either have a faster card or a slower CPU). Does the attached
> patch help?
> 

I applied the patch and let it sit in X overnight. The machine was still up when I checked it this morning. This is definetly a good sign. I'll keep you posted.

Machine details:
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 3
model name      : Pentium II (Klamath)
stepping        : 4
cpu MHz         : 300.664
cache size      : 512 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov mmx
bogomips        : 601.32
clflush size    : 32
power management:



01:00.0 VGA compatible controller: nVidia Corporation NV5 [RIVA TNT2/TNT2 Pro] (rev 15)

Definetly a slow machine by today's standards, but maybe the gpu is even slower??

Comment 23 Francisco Jerez 2009-09-16 05:30:57 UTC

(In reply to comment #22)
> I applied the patch and let it sit in X overnight. The machine was still up
> when I checked it this morning. This is definetly a good sign. I'll keep you
> posted.
> 
> Machine details:
> processor       : 0
> vendor_id       : GenuineIntel
> cpu family      : 6
> model           : 3
> model name      : Pentium II (Klamath)
> stepping        : 4
> cpu MHz         : 300.664
> cache size      : 512 KB
> fdiv_bug        : no
> hlt_bug         : no
> f00f_bug        : no
> coma_bug        : no
> fpu             : yes
> fpu_exception   : yes
> cpuid level     : 2
> wp              : yes
> flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
> mmx
> bogomips        : 601.32
> clflush size    : 32
> power management:
> 
Ok, so that's it.

> 
> 01:00.0 VGA compatible controller: nVidia Corporation NV5 [RIVA TNT2/TNT2 Pro]
> (rev 15)
> 
> Definetly a slow machine by today's standards, but maybe the gpu is even
> slower??
> 

Another annoying thing I realized:
> (II) NOUVEAU(0): Allocated 1MiB VRAM for offscreen pixmaps

FWIW you can just enable driver pixmaps if you want take advantage of the whole VRAM.

Comment 24 aeriksson 2009-09-17 00:15:39 UTC

I'm happy to report that the machine is still up and running after another 24h session. I'm inclined to think that the bug is fixed. If it's still running by the end of the week, I'll close the bug.


(In reply to comment #23)
> (In reply to comment #22)
> > I applied the patch and let it sit in X overnight. The machine was still up
> > when I checked it this morning. This is definetly a good sign. I'll keep you
> > posted.
> > 
> > Machine details:
<snip>
> > bogomips        : 601.32
> > clflush size    : 32
> > power management:
> > 
> Ok, so that's it.
> 
Hmm. What do you mean? That this is a relatively fast machine? 


> Another annoying thing I realized:
> > (II) NOUVEAU(0): Allocated 1MiB VRAM for offscreen pixmaps
> 
> FWIW you can just enable driver pixmaps if you want take advantage of the whole
> VRAM.
> 
I'd be happy to do so. How? Searching the man pages & web gave me nothing.

Comment 25 Francisco Jerez 2009-09-17 03:09:51 UTC

(In reply to comment #24)
> (In reply to comment #23)
> > (In reply to comment #22)
> > > I applied the patch and let it sit in X overnight. The machine was still up
> > > when I checked it this morning. This is definetly a good sign. I'll keep you
> > > posted.
> > > 
> > > Machine details:
> <snip>
> > > bogomips        : 601.32
> > > clflush size    : 32
> > > power management:
> > > 
> > Ok, so that's it.
> > 
> Hmm. What do you mean? That this is a relatively fast machine? 
> 
I meant your odds are worse with such a slow machine :-)

> 
> > Another annoying thing I realized:
> > > (II) NOUVEAU(0): Allocated 1MiB VRAM for offscreen pixmaps
> > 
> > FWIW you can just enable driver pixmaps if you want take advantage of the whole
> > VRAM.
> > 
> I'd be happy to do so. How? Searching the man pages & web gave me nothing.
> 

You need a newer X server, at least a 1.7.0 release candidate, if you're using KMS it's then enabled automatically.

Comment 26 Francisco Jerez 2009-09-17 04:07:09 UTC

*** Bug 23086 has been marked as a duplicate of this bug. ***

Comment 27 aeriksson 2009-09-23 11:25:10 UTC

Having tested the patch (and a recent git where the patch is included) for > 1 week without any issues, I believe the bug is fixed, so I close it.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.