Summary: | [any/all(?) cards below nv50(?)] libdrm built with gcc4.8 causes crashes/lockups during normal operation, and resume from disk/memory | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | xorg | Reporter: | Ronald <ronald645> | ||||||||||||||||||
Component: | Driver/nouveau | Assignee: | Nouveau Project <nouveau> | ||||||||||||||||||
Status: | RESOLVED FIXED | QA Contact: | Xorg Project Team <xorg-team> | ||||||||||||||||||
Severity: | normal | ||||||||||||||||||||
Priority: | medium | CC: | masao-takahashi | ||||||||||||||||||
Version: | 7.7 (2012.06) | ||||||||||||||||||||
Hardware: | x86 (IA32) | ||||||||||||||||||||
OS: | Linux (All) | ||||||||||||||||||||
Whiteboard: | |||||||||||||||||||||
i915 platform: | i915 features: | ||||||||||||||||||||
Attachments: |
|
Created attachment 88463 [details]
Attempt to resume (nouveau.agpmode=4)
Suggested by imirkin. No dice.
Created attachment 88464 [details]
Full list of all upgraded packages
Woa, tried nvidia-173 on - 3.11 - 3.10 - 3.4 All failed to resume with a black screen. I guess nouveau used to have an edge on this one! So, no mmiotraces =( ... Using agp=off for the Linux 3.11 kernel and NvAGP=1 in xorg.conf it works with the blob again. Let me know if you would like to have a trace and all. https://bugs.launchpad.net/suse/+bug/255601 Hi Ronald AFAICS there are two routes you/we can take this * Find out which package update caused the breakage * Disable AGP with nouveau (similar to what you did with the blob) If you're going for the first one, here is a list of packages that you'll need to check (other may not me related): * xserver-xorg-core * xserver-xorg-video-nouveau * All *mesa* The above can be rather messy and hard, as you'll need to track that the dependencies are met and/or rebuild packages. The latter route is much easier * Append nouveau.agpmode=0 to your kernel command line (i.e. grub/lilo) Created attachment 88546 [details]
2_Attempt to resume (nouveau.agpmode=0)
Survives one cycle since I did this from SSH with `pm-hibernate`. Apparently Xorg only locksup on resume because of xscreensaver if I press the button in XFCE.
Created attachment 88547 [details]
2_Attempt to resume (agp=off)
Also tried agp=off
Created attachment 88548 [details]
2_Attempt to resume
To confirm my hypotheses that disabling AGP has no effect I tried it without kernel parameters that alter the driver's behaviour. Same effect.
(see above three logfiles) I tried your second suggestion, did not work(!). - nouveau.agpmode=0 -> crash - agp=off -> crash - no kernel parameters -> typing pm-hibernate from SSH -> success(!) . Clicking somewhere on desktop -> crash (Only moving cursor did not make it crash). I'm not totally against bisecting these source packages, but that will then be a longterm thing. Furthermore, is all of mesa required? (i.e. I'm not using a compositing desktop for that matter (just XFCE)). Conclusion, it appears that disabling agp has no effect. Machine only crashes after resume if the screen is repainted with the exception of mouse movements. (In reply to comment #6) > Created attachment 88546 [details] > 2_Attempt to resume (nouveau.agpmode=0) > > Survives one cycle since I did this from SSH with `pm-hibernate`. Apparently > Xorg only locksup on resume because of xscreensaver if I press the button in > XFCE. I had a strange feeling that mesa is involved :\ If you're up for it you can {re,}move nouveau_dri.so (the 3g/gl "driver") and confirm it that fixes your s2ram problems. Note your CPU usage may sky-rocket (the system will fall back to software GL rendering). (In reply to comment #9) [...] > I'm not totally against bisecting these source packages, but that will then > be a longterm thing. Furthermore, is all of mesa required? (i.e. I'm not > using a compositing desktop for that matter (just XFCE)). > Before bisecting anything I would suggest that you absolutely make sure that you know which package broke things - mesa, xserver-xorg-core, xserver-xorg-video-nouveau and/or other. When going through mesa, make sure that you keep all the packages at the same version. Here is a crash course of what the different mesa packages provide Essential parts * libgl1-mesa-glx -> libGL provider * libgl1-mesa-dri -> provides the "hardware accelerated drivers" * libglapi-mesa -> dispatch library Extra (your system may require one or both depending on what software you have) * libgbm1 * libegl1-mesa -> libEGL provider Other * libopenvg1-mesa -> libOpenVG, afaik noone uses this * mesa-common-dev -> headers Not part of mesa (but have mesa in the package name :P) * libglu1-mesa -> minor suspect * mesa-utils -> highly unlikely to be related Have fun :) I relocated /usr/lib/i386-linux-gnu/dri/nouveau_dri.so to nouveau_dri_off.so and restarted the computer with AGP. I confirmed the AIGLX error in that it switches to swrast. Did not fix the s2disk issue (dmesg attached). So, it's not mesa nor the kernel. All that remains is (just...) the X server, xf86-video-nouveau and libdrm. Right? Created attachment 88558 [details]
3_Attempt to resume (nouveau_dri.so disabled)
(In reply to comment #11) > I relocated /usr/lib/i386-linux-gnu/dri/nouveau_dri.so to nouveau_dri_off.so > and restarted the computer with AGP. I confirmed the AIGLX error in that it > switches to swrast. > > Did not fix the s2disk issue (dmesg attached). > > So, it's not mesa nor the kernel. All that remains is (just...) the X > server, xf86-video-nouveau and libdrm. Right? Sounds like it. Do you know which gcc version is/was used to compile the new libdrm ? Might it be related to this http://gcc.gnu.org/ml/gcc-help/2013-07/msg00103.html? 13.10 (saucy) is using 4.8. 4.7 seems available, rebuilding libdrm with the source package should be easy. Just need to modify CC var I think... Recompiled libdrm with gcc-4.7 and BOOM the issue did p00f... =) I'll be reporting this to Ubuntu upstream. Thanks a ton! Filed a bugreport here: https://bugs.launchpad.net/ubuntu/+source/libdrm/+bug/1247607 Ronald all the thanks goes to the Andreas Radke (Arch developer) for finding and bisecting the issue. FWIW Arch has been using clang for ~3 months to build libdrm and it seems to work great. Ideally one of the Ubuntu/GCC devs can take a look and provide more bit more clarity than "This hints at something depending on undefined evaluation order between sequence points" [1] Would that be an issue with nouveau code, gcc and/or both will hopefully be determined soon. Cheers Emil [1] http://gcc.gnu.org/ml/gcc-help/2013-07/msg00151.html Wow this is definitely a weird interesting bug. It affects suspend-to-mem on my pci-e nv43 too. <after resume from suspend to memory> [ 40.107810] nouveau E[ PGRAPH][0000:01:00.0] ERROR nsource: DATA_ERROR nstatus: BAD_ARGUMENT [ 40.108059] nouveau E[ PGRAPH][0000:01:00.0] ch 3 [0x000f7000 compiz[1713]] subc 2 class 0x0039 mthd 0x0314 data 0x0051f000 (error repeats) I can confirm that rebuilding only libdrm-nouveau2 with the raring toolchain fixes it. So, we have nv34, nv43 and nv44 affected by this. I updated the bugreport title so hopefully more people will stumble on it earlier (hope that is okay). - nv44: GCC thread: During normal operation. - nv43: Resuming from suspend-to-mem - nv34: Resuming from suspend-to-disk Created attachment 88817 [details] [review] hack to force defined behavior As I feared, recompiling only pushbuf.c with gcc-4.7 fixes this bug. So something in pushbuf.c is miscompiled. Unfortunately it's also the biggest abuser of post increment ops. A simple workaround seems to be this patch. Compiled code with this patch: 00000000000015b0 <nouveau_pushbuf_reloc>: 15b0: 55 push %rbp 15b1: 48 8b 6f 30 mov 0x30(%rdi),%rbp 15b5: 53 push %rbx 15b6: 48 89 fb mov %rdi,%rbx 15b9: e8 42 ea ff ff callq 0 <pushbuf_krel> 15be: 89 45 00 mov %eax,0x0(%rbp) 15c1: 48 83 43 30 04 addq $0x4,0x30(%rbx) 15c6: 5b pop %rbx 15c7: 5d pop %rbp 15c8: c3 retq Without: 0000000000001650 <nouveau_pushbuf_reloc>: 1650: 48 89 5c 24 f0 mov %rbx,-0x10(%rsp) 1655: 48 89 6c 24 f8 mov %rbp,-0x8(%rsp) 165a: 48 83 ec 10 sub $0x10,%rsp 165e: 48 8b 6f 30 mov 0x30(%rdi),%rbp 1662: 48 89 fb mov %rdi,%rbx 1665: e8 96 e9 ff ff callq 0 <pushbuf_krel> 166a: 89 45 00 mov %eax,0x0(%rbp) 166d: 48 83 c5 04 add $0x4,%rbp 1671: 48 89 6b 30 mov %rbp,0x30(%rbx) 1675: 48 8b 1c 24 mov (%rsp),%rbx 1679: 48 8b 6c 24 08 mov 0x8(%rsp),%rbp 167e: 48 83 c4 10 add $0x10,%rsp 1682: c3 retq It seems that the nouveau_pushbuf_reloc difference is harmless. Code is equivalent, except that without the patch applied it uses another temp variable causing less optimal code generation. Remaining candidates: pushbuf_validate, nouveau_pushbuf_data, nouveau_pushbuf_del. Oops, so gcc is broken here after all, look at this.. 0000000000003860 <nouveau_pushbuf_reloc>: 3860: 53 push %rbx 3861: 48 8b 5f 30 mov 0x30(%rdi),%rbx 3865: 48 8d 43 04 lea 0x4(%rbx),%rax 3869: 48 89 47 30 mov %rax,0x30(%rdi) 386d: e8 3e ea ff ff callq 22b0 <pushbuf_krel> 3872: 89 03 mov %eax,(%rbx) 3874: 5b pop %rbx 3875: c3 retq Source: void nouveau_pushbuf_reloc(struct nouveau_pushbuf *push, struct nouveau_bo *bo, uint32_t data, uint32_t flags, uint32_t vor, uint32_t tor) { *push->cur++ = pushbuf_krel(push, bo, data, flags, vor, tor); } gcc-4.8 evaluates it as: void nouveau_pushbuf_reloc(struct nouveau_pushbuf *push, struct nouveau_bo *bo, uint32_t data, uint32_t flags, uint32_t vor, uint32_t tor) { push->cur++; *push->cur[-1] = pushbuf_krel(push, bo, data, flags, vor, tor); } while gcc-4.7 evaluates it as: void nouveau_pushbuf_reloc(struct nouveau_pushbuf *push, struct nouveau_bo *bo, uint32_t data, uint32_t flags, uint32_t vor, uint32_t tor) { *push->cur = pushbuf_krel(push, bo, data, flags, vor, tor); push->cur++; } Because the function also increments the push->cur pointer, it's a subtle but real difference: gcc-4.7: u32 ret = func(); // May change push->cur ptr *push->cur = ret; push->cur++; gcc-4.8: u32 *ptr = push->cur; push->cur++; *ptr = func(); // Already sees the push->cur ptr I'm not a language expert, so no idea if I'm right, but it seems the updated gcc-4.8 behavior is wrong here. Well, pedantically gcc-4.7 behavior is this: u32 *ptr = push->cur; *ptr = func(); push->cur++; But since func only reads push->cur it's the same. I am pretty sure the order of execution is considered undefined behavior here in C, thus the nouveau code is wrong (it is a side effect). Check out chapter 2.12 in "The C Programming Language" if you have it. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.
Created attachment 88462 [details] Attempt to resume [ 0.197865] nouveau [ DEVICE][0000:01:00.0] Chipset: NV34 (NV34) [ 0.197869] nouveau [ DEVICE][0000:01:00.0] Family : NV30 After *only upgrading userspace* from Ubuntu 13.04 to Ubuntu 13.10 the card fails to properly resume from s2disk. Important upgraded packages: xserver-xorg-core:i386 2:1.13.3-0ubuntu6.2 -> 2:1.14.3-3ubuntu2 xserver-xorg-video-nouveau:i386 1:1.0.7-0ubuntu1 -> 1:1.0.9-2ubuntu1 libdrm2:i386 2.4.43-0ubuntu1.1 -> 2.4.46-1 libpciaccess0:i386 0.13.1-2 -> 0.13.2-1