Created attachment 100709 [details] Htop sreenshot All tested mozilla apps (firefox, thunderbird and seamonkey) freeze on startup with nouveau-dri-10.2.1 libraries. If nouveau-dri libraries downgraded to 10.1.4, freezes gone, even without restarting x-server. Freeze localized in forked mozilla process (with ~100% cpu eating), killing this process can be restore mozilla's startup. This bug detected only in nouveau dual-screen configuration. Intel dual-screen works nornally, nouveau single-screen works too. Arch linux, i686 Linux kernels: 3.14.6-1 and 3.10.42-1 (LTS) Hardware: GeForce FX5200 /etc/X11/xorg.conf.d/21-nouveau.conf : Section "ServerLayout" Identifier "dual" Screen 0 "Screen0" Screen 1 "Screen1" RightOf "Screen0" EndSection Section "Monitor" Identifier "Monitor0" VendorName "Unknown" ModelName "Plug-n-play" DisplaySize 323 244 HorizSync 30.0 - 69.0 VertRefresh 50.0 - 160.0 Option "DPMS" EndSection Section "Monitor" Identifier "Monitor1" VendorName "Unknown" ModelName "TV-0" DisplaySize 200 150 EndSection Section "Device" Identifier "Videocard0" Driver "nouveau" BusID "PCI:1:0:0" Screen 0 Option "ZaphodHeads" "VGA-1" EndSection Section "Device" Identifier "Videocard1" Driver "nouveau" BusID "PCI:1:0:0" Screen 1 Option "ZaphodHeads" "TV-1" EndSection Section "Screen" Identifier "Screen0" Device "Videocard0" Monitor "Monitor0" DefaultDepth 24 SubSection "Display" Depth 24 Modes "1024x768_85" EndSubSection EndSection Section "Screen" Identifier "Screen1" Device "Videocard1" Monitor "Monitor1" DefaultDepth 24 SubSection "Display" Depth 24 Modes "720x576" EndSubSection EndSection
There haven't been too many changes between 10.1 and 10.2 to the nv30 driver. Mind doing a bisect? It seems reasonable to restrict it to src/gallium/drivers/nouveau/nv30. By the way, why are you using zaphod heads + 2 separate screens (and no xinerama I assume) instead of randr and having it all in one screen and dynamically configurable.
(In reply to comment #1) > There haven't been too many changes between 10.1 and 10.2 to the nv30 > driver. Mind doing a bisect? It seems reasonable to restrict it to > src/gallium/drivers/nouveau/nv30. I've never done a bisect before, but I have an strace log from mozilla start – with and without freeze. The freeze log is so big (9MB), but if it helps, I will post it tomorrow. > By the way, why are you using zaphod heads + 2 separate screens (and no > xinerama I assume) instead of randr and having it all in one screen and > dynamically configurable. I don't need a joint dynamic screen for two monitors. The second strictly dedicated screen used only for video playback on TV, and the first regular screen on monitor used for everything else.
(In reply to comment #2) > (In reply to comment #1) > > There haven't been too many changes between 10.1 and 10.2 to the nv30 > > driver. Mind doing a bisect? It seems reasonable to restrict it to > > src/gallium/drivers/nouveau/nv30. > I've never done a bisect before, but I have an strace log from mozilla start > – with and without freeze. The freeze log is so big (9MB), but if it helps, > I will post it tomorrow. strace log is not that useful. It's probably sitting around waiting for something to happen, which never happens. Bisecting is fairly easy... there are a bunch of guides online. I'd recommend installing into some scratch location and using LD_LIBRARY_PATH to pick up your new lib, so that you don't mess up your system install.
(In reply to comment #3) > (In reply to comment #2) > > (In reply to comment #1) > > > There haven't been too many changes between 10.1 and 10.2 to the nv30 > > > driver. Mind doing a bisect? It seems reasonable to restrict it to > > > src/gallium/drivers/nouveau/nv30. > > I've never done a bisect before, but I have an strace log from mozilla start > > – with and without freeze. The freeze log is so big (9MB), but if it helps, > > I will post it tomorrow. > > strace log is not that useful. It's probably sitting around waiting for > something to happen, which never happens. Bisecting is fairly easy... there > are a bunch of guides online. I'd recommend installing into some scratch > location and using LD_LIBRARY_PATH to pick up your new lib, so that you > don't mess up your system install. Oh, and you can find some build instructions at http://nouveau.freedesktop.org/wiki/InstallNouveau/#mesa
(In reply to comment #3) > strace log is not that useful. It's probably sitting around waiting for > something to happen, which never happens. Bisecting is fairly easy... there > are a bunch of guides online. I'd recommend installing into some scratch > location and using LD_LIBRARY_PATH to pick up your new lib, so that you > don't mess up your system install. It's not exactly "sitting around and waiting", it cycling around (up to 100% CPU used) and keeps repeating the same action, probably with nouveau-dri library. (In reply to comment #4) > Oh, and you can find some build instructions at > http://nouveau.freedesktop.org/wiki/InstallNouveau/#mesa Thank you, I will try to do it.
From freeze strace log: # Device opened: open("/dev/dri/card0", O_RDWR|O_CLOEXEC) = 5 ... close(5) = 0 # device /dev/dri/card0 closed, but trying to use already closed descriptor: fstat64(5, 0xbfa97f60) = -1 EBADF (Bad file descriptor) ioctl(5, 0xc0406481, 0xbfa97fc0) = -1 EBADF (Bad file descriptor) write(2, "nouveau: kernel rejected pushbuf"..., 54) = 54 write(2, "nouveau: ch0: krec 0 pushes 1 bu"..., 46) = 46 write(2, "nouveau: ch0: buf 00000000 00000"..., 63) = 63 write(2, "nouveau: ch0: psh 00000000 00000"..., 49) = 49 write(2, "nouveau: \t0x0008fd6c\n", 21) = 21 write(2, "nouveau: \t0x00000000\n", 21) = 21 write(2, "nouveau: \t0x00000003\n", 21) = 21 write(2, "nouveau: \t0x00020000\n", 21) = 21 write(2, "nouveau: \t0x00000000\n", 21) = 21 # Cycling infinitely sched_yield() = 0 sched_yield() = 0 sched_yield() = 0 ...
(In reply to comment #6) > From freeze strace log: > > # Device opened: > open("/dev/dri/card0", O_RDWR|O_CLOEXEC) = 5 > ... > close(5) = 0 > # device /dev/dri/card0 closed, but trying to use already closed descriptor: > fstat64(5, 0xbfa97f60) = -1 EBADF (Bad file descriptor) > ioctl(5, 0xc0406481, 0xbfa97fc0) = -1 EBADF (Bad file descriptor) > write(2, "nouveau: kernel rejected pushbuf"..., 54) = 54 Curious. Do you have a dmesg from that? And the full output that was written out to stderr? A bisect will still be the most effective way of tracking down the issue.
(In reply to comment #7) > Curious. Do you have a dmesg from that? And the full output that was written > out to stderr? No messages in any system log at this time. No stderr output – a library sends output to descriptor #2, but mozilla was opened it as /dev/null. > A bisect will still be the most effective way of tracking down the issue. Bisecting in progress, 2 steps from 10.
Bisecting not finished yet, but I think about that: http://cgit.freedesktop.org/mesa/mesa/commit/?id=507f0230d4ca2238c818006499e21abb4c133203
Bisect complete, I was right. 507f0230d4ca2238c818006499e21abb4c133203 is the first bad commit commit 507f0230d4ca2238c818006499e21abb4c133203 Author: Ilia Mirkin <imirkin@alum.mit.edu> Date: Wed Mar 5 22:25:55 2014 -0500 nouveau: fix fence waiting logic in screen destroy nouveau_fence_wait has the expectation that an external entity is holding onto the fence being waited on, not that it is merely held onto by the current pointer. Fixes a use-after-free in nouveau_fence_wait when used on the screen's current fence. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=75279 Signed-off-by: Ilia Mirkin <imirkin@alum.mit.edu> Reviewed-by: Christoph Bumiller <e0425955@student.tuwien.ac.at> Cc: "9.2 10.0 10.1" <mesa-stable@lists.freedesktop.org> :040000 040000 46ad31dd95db1695a8c91fffbf38ee7822d8a31f cdf816d872018460b33f69949010f09dcdcbf38e M src
Hm, that's funny. You said 10.1.4 worked, which contains that commit... I suspected that commit as well until I saw that you said 10.1.4 was fine. I'll think about it.
(In reply to comment #11) > Hm, that's funny. You said 10.1.4 worked, which contains that commit... I > suspected that commit as well until I saw that you said 10.1.4 was fine. > I'll think about it. I said 10.1.4 WAS fine, already builded (21.06.2014) and downloaded from Arch Linux repository. But in current building environment, all revisions since bisected commit seems buggy on nv30. I will try to revert this patch on tomorrow.
(In reply to comment #12) > (In reply to comment #11) > > Hm, that's funny. You said 10.1.4 worked, which contains that commit... I > > suspected that commit as well until I saw that you said 10.1.4 was fine. > > I'll think about it. > > I said 10.1.4 WAS fine, already builded (21.06.2014) and downloaded from > Arch Linux repository. Yes, that was my implication as well. > But in current building environment, all revisions since bisected commit > seems buggy on nv30. > > I will try to revert this patch on tomorrow. There's probably some additional interaction with a patch that hadn't made it to 10.1... perhaps the screen sharing change to better support NV_vdpau_interop, but I don't really see how that could cause issues either.
The version 10.2.1 tested: no freeze with reverted patch for nv30_screen.c
Bug not only with mozilla, but glxinfo: -- Version 10.2.1: freeze afrer end of (normal) output, and print this to stderr: nouveau: kernel rejected pushbuf: Bad file descriptor nouveau: ch0: krec 0 pushes 1 bufs 1 relocs 0 nouveau: ch0: buf 00000000 00000002 00000004 00000004 00000000 nouveau: ch0: psh 00000000 0000000140 0000000154 nouveau: 0x0008fd6c nouveau: 0x00000000 nouveau: 0x00000004 nouveau: 0x00020000 nouveau: 0x00000000 -- Version 10.2.1 (reverted patch): seems to bee ok. -- Version 10.1.4 (from repository): segfault after (normal) output, in dmesg: glxinfo[9715]: segfault at 88 ip b6d88829 sp bffa4420 error 4 in nouveau_dri.so[b6a37000+64a000] -- Last revision before bisected commit: seems to bee ok. -- Bisected commit: freeze, stderr: nouveau: kernel rejected pushbuf: Bad file descriptor nouveau: ch0: krec 0 pushes 1 bufs 1 relocs 0 nouveau: ch0: buf 00000000 00000002 00000004 00000004 00000000 nouveau: ch0: psh 00000000 0000000140 0000000154 nouveau: 0x0008fd6c nouveau: 0x00000000 nouveau: 0x00000004 nouveau: 0x00020000 nouveau: 0x00000000
glxtest (from bug #94465): -- Version 10.1.4 (from repository): VENDOR nouveau RENDERER Gallium 0.4 on NV34 VERSION 1.5 Mesa 10.1.4 TFP TRUE -- Version 10.2.1: nouveau: kernel rejected pushbuf: Bad file descriptor nouveau: ch0: krec 0 pushes 1 bufs 1 relocs 0 nouveau: ch0: buf 00000000 00000002 00000004 00000004 00000000 nouveau: ch0: psh 00000000 000000012c 0000000140 nouveau: 0x0008fd6c nouveau: 0x00000000 nouveau: 0x00000003 nouveau: 0x00020000 nouveau: 0x00000000 (freeze) -- Version 10.2.1 (reverted patch): VENDOR nouveau RENDERER Gallium 0.4 on NV34 VERSION 1.5 Mesa 10.2.1 TFP TRUE -- Last revision before bisected commit: VENDOR nouveau RENDERER Gallium 0.4 on NV34 VERSION 1.5 Mesa 10.2.0-devel (git-5bf90cb) TFP TRUE -- Bisected commit: nouveau: kernel rejected pushbuf: Bad file descriptor nouveau: ch0: krec 0 pushes 1 bufs 1 relocs 0 nouveau: ch0: buf 00000000 00000002 00000004 00000004 00000000 nouveau: ch0: psh 00000000 000000012c 0000000140 nouveau: 0x0008fd6c nouveau: 0x00000000 nouveau: 0x00000003 nouveau: 0x00020000 nouveau: 0x00000000 (freeze)
Sorry, glxtest is from bug #75279, of course.
What happens if you add screen->base.pushbuf->user_priv = NULL right before the if (screen->base.fence.current) check in nv30_screen_destroy?
(In reply to comment #18) > What happens if you add > > screen->base.pushbuf->user_priv = NULL > > right before the if (screen->base.fence.current) check in > nv30_screen_destroy? No effect. After llvm,llvm-libs update 3.4.1 -> 3.4.2 also glxgears starting to freeze instead of normal exit by ESC: Running synchronized to the vertical refresh. The framerate should be approximately the same as the monitor refresh rate. 2469 frames in 5.0 seconds = 493.688 FPS 2476 frames in 5.0 seconds = 495.177 FPS nouveau: kernel rejected pushbuf: Bad file descriptor nouveau: ch0: krec 0 pushes 1 bufs 1 relocs 0 nouveau: ch0: buf 00000000 00000004 00000004 00000004 00000000 nouveau: ch0: psh 00000000 0000025e84 0000025e98 nouveau: 0x0008fd6c nouveau: 0x00000000 nouveau: 0x00001437 nouveau: 0x00020000 nouveau: 0x00000000 (freeze)
(In reply to comment #19) > (In reply to comment #18) > > What happens if you add > > > > screen->base.pushbuf->user_priv = NULL > > > > right before the if (screen->base.fence.current) check in > > nv30_screen_destroy? > > No effect. Well, I've been unable to reproduce the "kernel rejected pushbuf" + hang behaviour, although I've only been testing with a NV43. However I did observe the issue that I thought I would observe due to that line not being there... here's a slightly more correct version than my previous suggestion: http://patchwork.freedesktop.org/patch/28050/ And another patch which I think is unlikely to affect what you're seeing, but who knows: http://patchwork.freedesktop.org/patch/28049/ Assuming these don't help, could you run that glxtest application inside valgrind, and let me know if you see any illegal accesses. (Note that the first one of those patches will in fact prevent one such illegal access.)
Both patches are not works for me. See valgrind log in attachment.
Created attachment 101283 [details] Valgrind glxtest logs Valgrind glxtest logs: 10.2.1 (clear); 10.2.1-reverted (without bisectred patch); 10.2.1-bufctx (avoid dangling... patch); 10.2.1-leaks (plug some memory leaks... patch); 10.2.1-bufctx_leaks (both patches)
OK, well, I can't repro this, but clearly the device is getting closed _somehow_. It appears that you're relatively able at debugging things, let me know if I've misunderstood. Can you load up glxtest in gdb (with my fixes), and do break main r break close break nv30_screen_destroy c and look at what happens. The first couple of closes will actually be at creation time because it opens/closes some config files, continue past those. This is what I see: (gdb) c Continuing. Breakpoint 2, nv50_screen_destroy (pscreen=0x633c60) at nv50/nv50_screen.c:295 295 struct nv50_screen *screen = nv50_screen(pscreen); (gdb) n 297 if (!nouveau_drm_screen_unref(&screen->base)) (gdb) p screen->base.device->fd $1 = 8 (gdb) c Continuing. Breakpoint 3, 0x00007ffff6db23d0 in close () from /lib64/libc.so.6 (gdb) bt #0 0x00007ffff6db23d0 in close () from /lib64/libc.so.6 #1 0x00007ffff643797a in dri2DestroyScreen (base=0x618c50) at dri2_glx.c:712 #2 0x00007ffff63fcdb5 in FreeScreenConfigs (priv=0x615600) at glxext.c:208 #3 0x00007ffff63fce3f in glx_display_free (priv=0x615600) at glxext.c:231 #4 0x00007ffff63fd01a in __glXCloseDisplay (dpy=0x6070f0, codes=0x615688) at glxext.c:279 #5 0x00007ffff78bc0c2 in XCloseDisplay () from /usr/lib64/libX11.so.6 #6 0x00000000004011be in glxtest () at glxtest.cpp:202 #7 0x000000000040120a in main (argc=1, argv=0x7fffffffdda8) at glxtest.cpp:211 (gdb) up #1 0x00007ffff643797a in dri2DestroyScreen (base=0x618c50) at dri2_glx.c:712 712 close(psc->fd); (gdb) p psc->fd $2 = 8 (gdb) c Continuing. Breakpoint 3, 0x00007ffff6db23d0 in close () from /lib64/libc.so.6 (gdb) bt #0 0x00007ffff6db23d0 in close () from /lib64/libc.so.6 #1 0x00007ffff6abddee in xcb_disconnect () from /usr/lib64/libxcb.so.1 #2 0x00007ffff78bc0f7 in XCloseDisplay () from /usr/lib64/libX11.so.6 #3 0x00000000004011be in glxtest () at glxtest.cpp:202 #4 0x000000000040120a in main (argc=1, argv=0x7fffffffdda8) at glxtest.cpp:211 (gdb) c [this is actually with nv50, but the nv30 flow should be no different] You must be getting something else, otherwise how is that fd getting closed already.
Created attachment 101333 [details] GDB log witn reverted patch GDB log witn reverted (bisected) patch
Created attachment 101336 [details] GDB log witn clear mesa-10.2.1 Before close: $ ls -l /proc/7967/fd/ total 0 lrwx------ 1 user users 64 Jun 19 11:41 0 -> /dev/pts/0 lrwx------ 1 user users 64 Jun 19 11:41 1 -> /dev/pts/0 lrwx------ 1 user users 64 Jun 19 11:41 2 -> /dev/pts/0 lrwx------ 1 user users 64 Jun 19 11:41 3 -> socket:[899028] lrwx------ 1 user users 64 Jun 19 11:41 4 -> /dev/dri/card0 lrwx------ 1 user users 64 Jun 19 11:41 5 -> /dev/dri/card0 after close: $ ls -l /proc/7967/fd/ total 0 lrwx------ 1 user users 64 Jun 19 11:41 0 -> /dev/pts/0 lrwx------ 1 user users 64 Jun 19 11:41 1 -> /dev/pts/0 lrwx------ 1 user users 64 Jun 19 11:41 2 -> /dev/pts/0 lrwx------ 1 user users 64 Jun 19 11:41 3 -> socket:[899028] lrwx------ 1 user users 64 Jun 19 11:41 5 -> /dev/dri/card0
Created attachment 101338 [details] GDB log with "avoid dangling..." patch /dev/dri/card0 on fd #4 closed, /dev/dri/card0 on fd #5 remained open.
(In reply to comment #26) > Created attachment 101338 [details] > GDB log with "avoid dangling..." patch > > /dev/dri/card0 on fd #4 closed, > /dev/dri/card0 on fd #5 remained open. So... for some *unknown* reason, the thing decides to create two screens. Fine. Now we have this thing which reuses screens for the same device... makes sense. But after it tries to close one of the screens, it also closes the fd that it used to create the screen with. Reasonable enough. However that fd is the one that's being used by the device. Perhaps it should be dup'ing the fd and doing all the ioctl's on that, so that this sort of thing doesn't happen. Maarten, thoughts?
Created attachment 101339 [details] GDB log with both patches All as previous two passes – device on fd #5 still opened after freeze.
But with reverted (bisected) patch both fd #4,#5 opened and closed successfully.
Mind testing out http://patchwork.freedesktop.org/patch/28219/ ?
Created attachment 101350 [details] GDB log with "dup fd..." patch It works! Before first close: $ ls -l /proc/29981/fd/ total 0 lrwx------ 1 user users 64 Jun 19 13:20 0 -> /dev/pts/0 lrwx------ 1 user users 64 Jun 19 13:20 1 -> /dev/pts/0 lrwx------ 1 user users 64 Jun 19 13:20 2 -> /dev/pts/0 lrwx------ 1 user users 64 Jun 19 13:20 3 -> socket:[1333878] lrwx------ 1 user users 64 Jun 19 13:20 4 -> /dev/dri/card0 lrwx------ 1 user users 64 Jun 19 13:20 5 -> /dev/dri/card0 lrwx------ 1 user users 64 Jun 19 13:20 6 -> /dev/dri/card0 After first close: $ ls -l /proc/29981/fd/ total 0 lrwx------ 1 user users 64 Jun 19 13:20 0 -> /dev/pts/0 lrwx------ 1 user users 64 Jun 19 13:20 1 -> /dev/pts/0 lrwx------ 1 user users 64 Jun 19 13:20 2 -> /dev/pts/0 lrwx------ 1 user users 64 Jun 19 13:20 3 -> socket:[1333878] lrwx------ 1 user users 64 Jun 19 13:20 5 -> /dev/dri/card0 lrwx------ 1 user users 64 Jun 19 13:20 6 -> /dev/dri/card0 After second close: $ ls -l /proc/29981/fd/ total 0 lrwx------ 1 user users 64 Jun 19 13:20 0 -> /dev/pts/0 lrwx------ 1 user users 64 Jun 19 13:20 1 -> /dev/pts/0 lrwx------ 1 user users 64 Jun 19 13:20 2 -> /dev/pts/0 lrwx------ 1 user users 64 Jun 19 13:20 3 -> socket:[1333878] lrwx------ 1 user users 64 Jun 19 13:20 6 -> /dev/dri/card0 After third close: $ ls -l /proc/29981/fd/ total 0 lrwx------ 1 user users 64 Jun 19 13:20 0 -> /dev/pts/0 lrwx------ 1 user users 64 Jun 19 13:20 1 -> /dev/pts/0 lrwx------ 1 user users 64 Jun 19 13:20 2 -> /dev/pts/0 lrwx------ 1 user users 64 Jun 19 13:20 3 -> socket:[1333878]
*** Bug 79946 has been marked as a duplicate of this bug. ***
(In reply to comment #31) > Created attachment 101350 [details] > GDB log with "dup fd..." patch > > It works! Great! Also I just figured out why this is happening to you and not to anyone else. It's the dual screens! (Who uses separate screens anyways...) src/glx/glxext.c allocates a screen per... screen. And then when it cleans up, it deletes the screens in the same order it created them. So it's precisely the scenario I outlined, where by the time the second screen goes to be destroyed, the fd it relies on (which is the first one) has already been closed. I'll push this patch out and tag it for 10.2.
A (very slightly) improved version of the patch is now in mesa master. Hopefully it'll get backported to 10.2.2, but if not, it'll be in 10.2.3.
(In reply to comment #34) > A (very slightly) improved version of the patch is now in mesa master. > Hopefully it'll get backported to 10.2.2, but if not, it'll be in 10.2.3. An improved patch http://cgit.freedesktop.org/mesa/mesa/commit/?id=a59f2bb17bcc78e09653391748549e7973990798 was tested and also works for me.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.