Bug 29583

Summary:

nouveau freeze Xorg on NV34

Product:

xorg

Reporter:

DarkRaven <drdarkraven>

Component:

Driver/nouveau

Assignee:

Nouveau Project <nouveau>

Status:

RESOLVED FIXED

QA Contact:

Xorg Project Team <xorg-team>

Severity:

normal

Priority:

highest

CC:

drdarkraven, hiyuh.root, mschiffer, w41ter

Version:

git

Hardware:

x86 (IA32)

OS:

Linux (All)

Whiteboard:

i915 platform:

i915 features:

Attachments:

Description	Flags
Kernel log	none
Kernel configuration	none
Xorg log	none
Xorg log	none
Kernel log	none
More Kernel Log	none
nouveau_ttm_preemption.patch	none

Description DarkRaven 2010-08-15 06:45:44 UTC

The bug was first sighted when resize a 'fake transparent' rxvt window.
This action freeze Xorg.
And with some further investigation, I discovered that the bug is reproducible by running:
 x11perf -compwinwin500
(And it's fine when running x11perf -putimage500)

And not only Xorg is frozen,it seems the whole system is frozen,too.
The cursor can't be moved,and ssh connection is lost.

I'm using a snapshot kernel,the bug appeared after I updated my kernel tree at Aug 7.
And there is no such bug in nouveau kernel tree.

Here is my kernel log output.

Comment 1 DarkRaven 2010-08-15 06:46:30 UTC

Created attachment 37887 [details]
Kernel log

Comment 2 DarkRaven 2010-08-15 06:48:16 UTC

Created attachment 37888 [details]
Kernel configuration

Comment 3 DarkRaven 2010-08-17 01:35:47 UTC

Created attachment 37908 [details]
Xorg log

After '2.6.36-rc1' is merged into nouveau,the same problem appears with nouveau kernel tree.

Here is my xorg.0.log

Comment 4 DarkRaven 2010-08-17 01:42:38 UTC

Created attachment 37909 [details]
Xorg log

Sorry,I submitted the wrong one.

Comment 5 DarkRaven 2010-08-23 23:40:07 UTC

Curious behavior of nouveau driver(though I don't know if it's connect with this problem):
Once nouveau driver is loaded,after the screen resolution changed(which is normal),there is a area,about 640*480 in size,at the top-left corner of the screen,is totally white.

This behavior started maybe after 2.6.36-rc1

Comment 6 Francisco Jerez 2010-08-25 15:54:40 UTC

(In reply to comment #3)
> Created an attachment (id=37908) [details]
> Xorg log
> 
> After '2.6.36-rc1' is merged into nouveau,the same problem appears with nouveau
> kernel tree.
> 
You could try to bisect this problem (see "man git-bisect").

> Here is my xorg.0.log

(In reply to comment #5)
> Curious behavior of nouveau driver(though I don't know if it's connect with
> this problem):
> Once nouveau driver is loaded,after the screen resolution changed(which is
> normal),there is a area,about 640*480 in size,at the top-left corner of the
> screen,is totally white.
> 
> This behavior started maybe after 2.6.36-rc1

That's an unrelated issue, it's already fixed in Andrew Morton's -mm tree, commit "vt: fix console corruption on driver hand-over".

Comment 7 DarkRaven 2010-08-25 22:26:27 UTC

Bisect result:

58374713c9dfb4d231f8c56cac089f6fbdedc2ec is the first bad commit
commit 58374713c9dfb4d231f8c56cac089f6fbdedc2ec
Author: Arnd Bergmann <arnd@arndb.de>
Date:   Sat Jul 10 23:51:39 2010 +0200

    drm: kill BKL from common code

Comment 8 Francisco Jerez 2010-08-26 02:26:01 UTC

(In reply to comment #7)
> Bisect result:
> 
> 58374713c9dfb4d231f8c56cac089f6fbdedc2ec is the first bad commit
> commit 58374713c9dfb4d231f8c56cac089f6fbdedc2ec
> Author: Arnd Bergmann <arnd@arndb.de>
> Date:   Sat Jul 10 23:51:39 2010 +0200
> 
>     drm: kill BKL from common code

Ben may have fixed this with e644bb2066cc46c01e5f8902bf840f19e1f942c6, could you try again with latest git?

Comment 9 DarkRaven 2010-08-26 03:01:47 UTC

Tried,doesn't fix the bug.

Comment 10 Ben Skeggs 2010-08-26 03:20:52 UTC

Yeah, I didn't expect my earlier fix to fix this or the other issues that have been reported.  They fix another issue I encountered however.  I have been unable to reproduce these so far, but my kernel has additional changes which could possible effect things, I will try in more depth tomorrow.

Comment 11 Francisco Jerez 2010-08-27 05:06:27 UTC

You could try to build a kernel with CONFIG_LOCKUP_DETECTOR enabled. The lockup detector logs a backtrace when the kernel stays stuck for a minute or so. It may be useful along with a serial/netconsole to track this problem down.

Comment 12 DarkRaven 2010-08-27 05:19:59 UTC

The all-is-frozen situation described in the comment only appeared once.
During my bisecting,while Xorg is frozen,I'm still able to move mouse and network is still working.
So according to the definition of 'Hardlockups':
  Hardlockups are bugs that cause the CPU to loop in kernel mode
  for more than 60 seconds, without letting other interrupts have a
  chance to run.
I don't think this is a hardlockup.

Comment 13 Francisco Jerez 2010-08-27 05:35:18 UTC

(In reply to comment #12)
> The all-is-frozen situation described in the comment only appeared once.
> During my bisecting,while Xorg is frozen,I'm still able to move mouse and
> network is still working.
> So according to the definition of 'Hardlockups':
>   Hardlockups are bugs that cause the CPU to loop in kernel mode
>   for more than 60 seconds, without letting other interrupts have a
>   chance to run.
> I don't think this is a hardlockup.

Ah! Then CONFIG_DETECT_HUNG_TASK would be more appropriate.

Comment 14 Francisco Jerez 2010-08-27 06:15:21 UTC

(In reply to comment #13)
> (In reply to comment #12)
> > The all-is-frozen situation described in the comment only appeared once.
> > During my bisecting,while Xorg is frozen,I'm still able to move mouse and
> > network is still working.
Actually if it were some kind of kernel deadlock you wouldn't be able to move the mouse at all. Most likely the card has locked up for some reason, can you reboot with "drm.debug=3 log_buf_len=256k" in the kernel command line and attach new kernel logs after the hang?

> > So according to the definition of 'Hardlockups':
> >   Hardlockups are bugs that cause the CPU to loop in kernel mode
> >   for more than 60 seconds, without letting other interrupts have a
> >   chance to run.
> > I don't think this is a hardlockup.
> 
> Ah! Then CONFIG_DETECT_HUNG_TASK would be more appropriate.

Comment 15 DarkRaven 2010-08-27 06:37:03 UTC

Created attachment 38220 [details]
Kernel log

Only attach part of it,should be enough.
Don't think you would need the full 247k dmesg output.

Comment 16 DarkRaven 2010-08-27 06:39:24 UTC

BTW,X is running at R stat while hanging,and top shows sys takes all the cpu time.

Comment 17 DarkRaven 2010-08-27 06:51:32 UTC

FYI,commit 58374713c9dfb4d231f8c56cac089f6fbdedc2ec changed lock_kernel() and unlock_kernel() before and after the func() call to mutex_lock() and mutex_unlock().

Comment 18 Francisco Jerez 2010-08-27 08:02:38 UTC

(In reply to comment #15)
> Created an attachment (id=38220) [details]
> Kernel log
> 
> Only attach part of it,should be enough.
> Don't think you would need the full 247k dmesg output.

It looks like fallout from an earlier problem, so yeah, full kernel logs would be interesting.

Comment 19 DarkRaven 2010-08-27 08:09:27 UTC

Created attachment 38224 [details]
More Kernel Log

Comment 20 Francisco Jerez 2010-08-28 04:31:23 UTC

(In reply to comment #19)
> Created an attachment (id=38224) [details]
> More Kernel Log
> [...]
>
> ----Many similar lines----
>
So you claim there are no errors there? Please, provide *full* kernel logs.

BTW, can you reproduce this with the 3D drivers uninstalled?

Comment 21 DarkRaven 2010-08-28 05:25:15 UTC

Full kernel log here (too large for bugzilla):
http://pastebin.com/5TKurz6u

And I can reproduce it without 3D driver (completely without 3D,nouveau_dri,libGL & mesa uninstalled)

Comment 22 DarkRaven 2010-08-28 07:26:30 UTC

BTW,I won't have chance to use this computer (with NV34 graphic card) for the coming month.

Comment 23 Francisco Jerez 2010-08-28 08:33:50 UTC

Created attachment 38240 [details] [review]
nouveau_ttm_preemption.patch

(In reply to comment #22)
> BTW,I won't have chance to use this computer (with NV34 graphic card) for the
> coming month.

Most likely this is a race between the X server thread and the TTM delayed work queue. It couldn't happen before because taking the BKL disables preemption. Are you still in time to test patches?

Comment 24 DarkRaven 2010-08-28 16:01:27 UTC

(In reply to comment #23)
> Created an attachment (id=38240) [details]
> nouveau_ttm_preemption.patch
> 
> (In reply to comment #22)
> > BTW,I won't have chance to use this computer (with NV34 graphic card) for the
> > coming month.
> 
> Most likely this is a race between the X server thread and the TTM delayed work
> queue. It couldn't happen before because taking the BKL disables preemption.
> Are you still in time to test patches?

Sorry, can't.

Comment 25 walt 2010-08-28 16:09:42 UTC

I've been having similar freezes with NV34 while using very recent kernels, so I would be very happy to test patches.

I'm using the nouveau support in Linus.git kernel, though, not nouveau.git.  I could start using nouveau.git if that's what you'd be creating patches from.

Comment 26 Francisco Jerez 2010-08-28 18:23:41 UTC

(In reply to comment #25)
> I've been having similar freezes with NV34 while using very recent kernels, so
> I would be very happy to test patches.
> 
> I'm using the nouveau support in Linus.git kernel, though, not nouveau.git.  I
> could start using nouveau.git if that's what you'd be creating patches from.

Yeah, you could apply it over Linus' tree, if you want, but it has already been reported to solve the problem.

Comment 27 walt 2010-08-29 10:59:37 UTC

Today's kernel from Linus seems to fix the problem for me too, fingers crossed.

Comment 28 Francisco Jerez 2010-08-30 11:53:49 UTC

I've pushed the fix to master, closing.

Comment 29 Francisco Jerez 2010-08-30 11:59:41 UTC

*** Bug 29809 has been marked as a duplicate of this bug. ***

Comment 30 walt 2010-08-31 19:46:57 UTC

(In reply to comment #28)
> I've pushed the fix to master, closing.

The bug may be closed, but the dumb questions linger on forever :p

Yesterday I noticed a very similar hang on my NV4 machine (which I didn't see before yesterday).

Does it make sense that this same bug would also affect an NV4 chipset?  Or should I start thinking about filing a different bug report?

Many thanks!

Comment 31 Francisco Jerez 2010-09-01 05:05:37 UTC

(In reply to comment #30)
> (In reply to comment #28)
> > I've pushed the fix to master, closing.
> 
> The bug may be closed, but the dumb questions linger on forever :p
> 
> Yesterday I noticed a very similar hang on my NV4 machine (which I didn't see
> before yesterday).
> 
> Does it make sense that this same bug would also affect an NV4 chipset?  Or
> should I start thinking about filing a different bug report?
> 
Yeah it's probably the same issue, this bug affected the whole card range.

> Many thanks!

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.