Bug 71116 - [any/all(?) cards below nv50(?)] libdrm built with gcc4.8 causes crashes/lockups during normal operation, and resume from disk/memory
Summary: [any/all(?) cards below nv50(?)] libdrm built with gcc4.8 causes crashes/lock...
Status: RESOLVED FIXED
Alias: None
Product: xorg
Classification: Unclassified
Component: Driver/nouveau (show other bugs)
Version: 7.7 (2012.06)
Hardware: x86 (IA32) Linux (All)
: medium normal
Assignee: Nouveau Project
QA Contact: Xorg Project Team
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-11-01 08:40 UTC by Ronald
Modified: 2013-11-08 08:04 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
Attempt to resume (33.44 KB, text/plain)
2013-11-01 08:40 UTC, Ronald
no flags Details
Attempt to resume (nouveau.agpmode=4) (49.39 KB, text/plain)
2013-11-01 08:41 UTC, Ronald
no flags Details
Full list of all upgraded packages (75.89 KB, text/plain)
2013-11-01 08:41 UTC, Ronald
no flags Details
2_Attempt to resume (nouveau.agpmode=0) (38.89 KB, text/plain)
2013-11-03 07:38 UTC, Ronald
no flags Details
2_Attempt to resume (agp=off) (46.16 KB, text/plain)
2013-11-03 07:39 UTC, Ronald
no flags Details
2_Attempt to resume (37.20 KB, text/plain)
2013-11-03 07:40 UTC, Ronald
no flags Details
3_Attempt to resume (nouveau_dri.so disabled) (36.65 KB, text/plain)
2013-11-03 13:29 UTC, Ronald
no flags Details
hack to force defined behavior (1.65 KB, patch)
2013-11-07 10:31 UTC, Maarten Lankhorst
no flags Details | Splinter Review

Description Ronald 2013-11-01 08:40:06 UTC
Created attachment 88462 [details]
Attempt to resume

[    0.197865] nouveau  [  DEVICE][0000:01:00.0] Chipset: NV34 (NV34)
[    0.197869] nouveau  [  DEVICE][0000:01:00.0] Family : NV30

After *only upgrading userspace* from Ubuntu 13.04 to Ubuntu 13.10 the card fails to properly resume from s2disk.

Important upgraded packages:

xserver-xorg-core:i386 2:1.13.3-0ubuntu6.2 -> 2:1.14.3-3ubuntu2
xserver-xorg-video-nouveau:i386 1:1.0.7-0ubuntu1 -> 1:1.0.9-2ubuntu1
libdrm2:i386 2.4.43-0ubuntu1.1 -> 2.4.46-1
libpciaccess0:i386 0.13.1-2 -> 0.13.2-1
Comment 1 Ronald 2013-11-01 08:41:07 UTC
Created attachment 88463 [details]
Attempt to resume (nouveau.agpmode=4)

Suggested by imirkin. No dice.
Comment 2 Ronald 2013-11-01 08:41:44 UTC
Created attachment 88464 [details]
Full list of all upgraded packages
Comment 3 Ronald 2013-11-02 14:59:07 UTC
Woa, tried nvidia-173 on

- 3.11
- 3.10
- 3.4

All failed to resume with a black screen. I guess nouveau used to have an edge on this one!

So, no mmiotraces =( ...
Comment 4 Ronald 2013-11-02 16:34:25 UTC
Using agp=off for the Linux 3.11 kernel and NvAGP=1 in xorg.conf it works with the blob again.

Let me know if you would like to have a trace and all.

https://bugs.launchpad.net/suse/+bug/255601
Comment 5 Emil Velikov 2013-11-02 19:12:59 UTC
Hi Ronald

AFAICS there are two routes you/we can take this
* Find out which package update caused the breakage
* Disable AGP with nouveau (similar to what you did with the blob)

If you're going for the first one, here is a list of packages that you'll need to check (other may not me related):
* xserver-xorg-core
* xserver-xorg-video-nouveau
* All *mesa*

The above can be rather messy and hard, as you'll need to track that the dependencies are met and/or rebuild packages.

The latter route is much easier
* Append nouveau.agpmode=0 to your kernel command line (i.e. grub/lilo)
Comment 6 Ronald 2013-11-03 07:38:42 UTC
Created attachment 88546 [details]
2_Attempt to resume (nouveau.agpmode=0)

Survives one cycle since I did this from SSH with `pm-hibernate`. Apparently Xorg only locksup on resume because of xscreensaver if I press the button in XFCE.
Comment 7 Ronald 2013-11-03 07:39:20 UTC
Created attachment 88547 [details]
2_Attempt to resume (agp=off)

Also tried agp=off
Comment 8 Ronald 2013-11-03 07:40:11 UTC
Created attachment 88548 [details]
2_Attempt to resume

To confirm my hypotheses that disabling AGP has no effect I tried it without kernel parameters that alter the driver's behaviour. Same effect.
Comment 9 Ronald 2013-11-03 07:40:35 UTC
(see above three logfiles)

I tried your second suggestion, did not work(!).

- nouveau.agpmode=0 -> crash
- agp=off -> crash
- no kernel parameters -> typing pm-hibernate from SSH -> success(!) . Clicking somewhere on desktop -> crash (Only moving cursor did not make it crash).

I'm not totally against bisecting these source packages, but that will then be a longterm thing. Furthermore, is all of mesa required? (i.e. I'm not using a compositing desktop for that matter (just XFCE)).

Conclusion, it appears that disabling agp has no effect. Machine only crashes after resume if the screen is repainted with the exception of mouse movements.
Comment 10 Emil Velikov 2013-11-03 11:49:15 UTC
(In reply to comment #6)
> Created attachment 88546 [details]
> 2_Attempt to resume (nouveau.agpmode=0)
> 
> Survives one cycle since I did this from SSH with `pm-hibernate`. Apparently
> Xorg only locksup on resume because of xscreensaver if I press the button in
> XFCE.
I had a strange feeling that mesa is involved :\

If you're up for it you can {re,}move nouveau_dri.so (the 3g/gl "driver") and confirm it that fixes your s2ram problems.
Note your CPU usage may sky-rocket (the system will fall back to software GL rendering).

(In reply to comment #9)
[...]
> I'm not totally against bisecting these source packages, but that will then
> be a longterm thing. Furthermore, is all of mesa required? (i.e. I'm not
> using a compositing desktop for that matter (just XFCE)).
> 
Before bisecting anything I would suggest that you absolutely make sure that you know which package broke things - mesa, xserver-xorg-core, xserver-xorg-video-nouveau and/or other.

When going through mesa, make sure that you keep all the packages at the same version.

Here is a crash course of what the different mesa packages provide
Essential parts
* libgl1-mesa-glx -> libGL provider
* libgl1-mesa-dri -> provides the "hardware accelerated drivers"
* libglapi-mesa -> dispatch library

Extra (your system may require one or both depending on what software you have)
* libgbm1
* libegl1-mesa -> libEGL provider

Other
* libopenvg1-mesa -> libOpenVG, afaik noone uses this
* mesa-common-dev -> headers

Not part of mesa (but have mesa in the package name :P)
* libglu1-mesa -> minor suspect
* mesa-utils -> highly unlikely to be related

Have fun :)
Comment 11 Ronald 2013-11-03 13:28:52 UTC
I relocated /usr/lib/i386-linux-gnu/dri/nouveau_dri.so to nouveau_dri_off.so and restarted the computer with AGP. I confirmed the AIGLX error in that it switches to swrast.

Did not fix the s2disk issue (dmesg attached).

So, it's not mesa nor the kernel. All that remains is (just...) the X server, xf86-video-nouveau and libdrm. Right?
Comment 12 Ronald 2013-11-03 13:29:21 UTC
Created attachment 88558 [details]
3_Attempt to resume (nouveau_dri.so disabled)
Comment 13 Emil Velikov 2013-11-03 13:59:11 UTC
(In reply to comment #11)
> I relocated /usr/lib/i386-linux-gnu/dri/nouveau_dri.so to nouveau_dri_off.so
> and restarted the computer with AGP. I confirmed the AIGLX error in that it
> switches to swrast.
> 
> Did not fix the s2disk issue (dmesg attached).
> 
> So, it's not mesa nor the kernel. All that remains is (just...) the X
> server, xf86-video-nouveau and libdrm. Right?

Sounds like it.
Do you know which gcc version is/was used to compile the new libdrm ? Might it be related to this http://gcc.gnu.org/ml/gcc-help/2013-07/msg00103.html?
Comment 14 Ronald 2013-11-03 15:10:53 UTC
13.10 (saucy) is using 4.8. 4.7 seems available, rebuilding libdrm with the source package should be easy. Just need to modify CC var I think...
Comment 15 Ronald 2013-11-03 15:49:42 UTC
Recompiled libdrm with gcc-4.7 and BOOM the issue did p00f... =)

I'll be reporting this to Ubuntu upstream. Thanks a ton!
Comment 16 Ronald 2013-11-03 15:57:49 UTC
Filed a bugreport here: https://bugs.launchpad.net/ubuntu/+source/libdrm/+bug/1247607
Comment 17 Emil Velikov 2013-11-03 16:16:07 UTC
Ronald all the thanks goes to the Andreas Radke (Arch developer) for finding and bisecting the issue. FWIW Arch has been using clang for ~3 months to build libdrm and it seems to work great.

Ideally one of the Ubuntu/GCC devs can take a look and provide more bit more clarity than
"This hints at something depending on undefined evaluation order between sequence points" [1]

Would that be an issue with nouveau code, gcc and/or both will hopefully be determined soon.

Cheers
Emil
[1] http://gcc.gnu.org/ml/gcc-help/2013-07/msg00151.html
Comment 18 Maarten Lankhorst 2013-11-07 09:58:44 UTC
Wow this is definitely a weird interesting bug.

It affects suspend-to-mem on my pci-e nv43 too.

<after resume from suspend to memory>
[   40.107810] nouveau E[  PGRAPH][0000:01:00.0]  ERROR nsource: DATA_ERROR nstatus: BAD_ARGUMENT
[   40.108059] nouveau E[  PGRAPH][0000:01:00.0] ch 3 [0x000f7000 compiz[1713]] subc 2 class 0x0039 mthd 0x0314 data 0x0051f000
(error repeats)

I can confirm that rebuilding only libdrm-nouveau2 with the raring toolchain fixes it.
Comment 19 Ronald 2013-11-07 10:04:27 UTC
So, we have nv34, nv43 and nv44 affected by this. I updated the bugreport title so hopefully more people will stumble on it earlier (hope that is okay).

- nv44: GCC thread: During normal operation.
- nv43: Resuming from suspend-to-mem
- nv34: Resuming from suspend-to-disk
Comment 20 Maarten Lankhorst 2013-11-07 10:31:40 UTC
Created attachment 88817 [details] [review]
hack to force defined behavior

As I feared, recompiling only pushbuf.c with gcc-4.7 fixes this bug. So something in pushbuf.c is miscompiled. Unfortunately it's also the biggest abuser of post increment ops.

A simple workaround seems to be this patch.
Comment 21 Maarten Lankhorst 2013-11-07 10:51:55 UTC
Compiled code with this patch:

00000000000015b0 <nouveau_pushbuf_reloc>:
    15b0:       55                      push   %rbp
    15b1:       48 8b 6f 30             mov    0x30(%rdi),%rbp
    15b5:       53                      push   %rbx
    15b6:       48 89 fb                mov    %rdi,%rbx
    15b9:       e8 42 ea ff ff          callq  0 <pushbuf_krel>
    15be:       89 45 00                mov    %eax,0x0(%rbp)
    15c1:       48 83 43 30 04          addq   $0x4,0x30(%rbx)
    15c6:       5b                      pop    %rbx
    15c7:       5d                      pop    %rbp
    15c8:       c3                      retq   

Without:
0000000000001650 <nouveau_pushbuf_reloc>:
    1650:       48 89 5c 24 f0          mov    %rbx,-0x10(%rsp)
    1655:       48 89 6c 24 f8          mov    %rbp,-0x8(%rsp)
    165a:       48 83 ec 10             sub    $0x10,%rsp
    165e:       48 8b 6f 30             mov    0x30(%rdi),%rbp
    1662:       48 89 fb                mov    %rdi,%rbx
    1665:       e8 96 e9 ff ff          callq  0 <pushbuf_krel>
    166a:       89 45 00                mov    %eax,0x0(%rbp)
    166d:       48 83 c5 04             add    $0x4,%rbp
    1671:       48 89 6b 30             mov    %rbp,0x30(%rbx)
    1675:       48 8b 1c 24             mov    (%rsp),%rbx
    1679:       48 8b 6c 24 08          mov    0x8(%rsp),%rbp
    167e:       48 83 c4 10             add    $0x10,%rsp
    1682:       c3                      retq   

It seems that the nouveau_pushbuf_reloc difference is harmless. Code is equivalent, except that without the patch applied it uses another temp variable causing less optimal code generation.

Remaining candidates: pushbuf_validate, nouveau_pushbuf_data, nouveau_pushbuf_del.
Comment 22 Maarten Lankhorst 2013-11-07 11:57:53 UTC
Oops, so gcc is broken here after all, look at this..

0000000000003860 <nouveau_pushbuf_reloc>:
    3860:       53                      push   %rbx
    3861:       48 8b 5f 30             mov    0x30(%rdi),%rbx
    3865:       48 8d 43 04             lea    0x4(%rbx),%rax
    3869:       48 89 47 30             mov    %rax,0x30(%rdi)
    386d:       e8 3e ea ff ff          callq  22b0 <pushbuf_krel>
    3872:       89 03                   mov    %eax,(%rbx)
    3874:       5b                      pop    %rbx
    3875:       c3                      retq   

Source:

void
nouveau_pushbuf_reloc(struct nouveau_pushbuf *push, struct nouveau_bo *bo,
		      uint32_t data, uint32_t flags, uint32_t vor, uint32_t tor)
{
	*push->cur++ = pushbuf_krel(push, bo, data, flags, vor, tor);
}

gcc-4.8 evaluates it as:

void
nouveau_pushbuf_reloc(struct nouveau_pushbuf *push, struct nouveau_bo *bo,
		      uint32_t data, uint32_t flags, uint32_t vor, uint32_t tor)
{
	push->cur++;
	*push->cur[-1] = pushbuf_krel(push, bo, data, flags, vor, tor);
}
Comment 23 Maarten Lankhorst 2013-11-07 11:59:11 UTC
while gcc-4.7 evaluates it as:

void
nouveau_pushbuf_reloc(struct nouveau_pushbuf *push, struct nouveau_bo *bo,
		      uint32_t data, uint32_t flags, uint32_t vor, uint32_t tor)
{
	*push->cur = pushbuf_krel(push, bo, data, flags, vor, tor);
	push->cur++;
}
Comment 24 Maarten Lankhorst 2013-11-07 12:07:01 UTC
Because the function also increments the push->cur pointer, it's a subtle but real difference:

gcc-4.7:

u32 ret = func(); // May change push->cur ptr
*push->cur = ret;
push->cur++;

gcc-4.8:
u32 *ptr = push->cur;
push->cur++;

*ptr = func(); // Already sees the push->cur ptr


I'm not a language expert, so no idea if I'm right, but it seems the updated gcc-4.8 behavior is wrong here.
Comment 25 Maarten Lankhorst 2013-11-07 12:24:40 UTC
Well, pedantically gcc-4.7 behavior is this:

u32 *ptr = push->cur;
*ptr = func();
push->cur++;

But since func only reads push->cur it's the same.
Comment 26 Kelly Doran 2013-11-07 12:57:02 UTC
I am pretty sure the order of execution is considered undefined behavior here in C, thus the nouveau code is wrong (it is a side effect).  Check out chapter 2.12 in "The C Programming Language" if you have it.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.