Hi, I'm just forwarding to you a bug report originally made against Firefox: https://bugzilla.mozilla.org/show_bug.cgi?id=973192 The symptom is Firefox freezing on startup. The cause (see around comment 39--42 on that bug) is that a XCloseDisplay() call takes about one minute to return. The original reporter of the bug has recorded a profile which shows time spent in nouveau_dri.so (see comments 50, 51 on the mozilla bug) If you want to see what the application code triggering this bug looks like, it's all self-contained in this file: http://hg.mozilla.org/mozilla-central/file/tip/toolkit/xre/glxtest.cpp 228 glXMakeCurrent(dpy, None, nullptr); 229 glXDestroyContext(dpy, context); 230 XDestroyWindow(dpy, window); 231 XFreeColormap(dpy, swa.colormap); 232 XCloseDisplay(dpy); // <--- this XCloseDisplay call takes 1 minute 233 dlclose(libgl); Is there some more information that you need, that isn't already on the mozilla bug? Is this application code doing something evil here? Is this a known issue? Do you have a guess as to whether it might affect many users (i.e., should we try working around it)?
Hi Benoit Jacob Can you attach the output of glxinfo and dmesg after you've come across this issue ? Do you have a link to a build we can reuse ? -Emil
If you can hack out the glxtest into a separate program that would ease all of use alot. Cheers
Created attachment 94465 [details] Stand-alone version of the glxtest program Here you go. build like this: c++ glxtest.cpp -o glxtest -ldl -lX11 run like this: ./glxtest Example output: VENDOR Tungsten Graphics, Inc RENDERER Mesa DRI Intel(R) Sandybridge Mobile VERSION 3.0 Mesa 8.0.4 TFP TRUE
I have also asked the original reporter to share his glxinfo output and details about his setup. You might want to CC yourself on the mozilla bug.
(In reply to comment #4) > I have also asked the original reporter to share his glxinfo output and > details about his setup. You might want to CC yourself on the mozilla bug. Might be easier to get the original reporter onto this one... Can you confirm that the standalone glxtest causes the reported 1-minute stall as well?
Good point, and good question, which I've forwarded on the mozilla bug.
I'm the reporter on mozilla bugzilla tracker. Some technical infos on my computer. Using archlinux, 64 bits. Software versions ? Kernel upgraded less than an hour ago. [fred@fredo-arch ~]$ pacman -Si mesa | grep Version Version : 10.0.3-1 [fred@fredo-arch ~]$ pacman -Si xf86-video-nouveau | grep Version Version : 1.0.10-2 [fred@fredo-arch ~]$ pacman -Si linux | grep Version Version : 3.13.4-1 Hardware ? [fred@fredo-arch ~]$ lspci | grep VGA 02:00.0 VGA compatible controller: NVIDIA Corporation GT218 [GeForce 210] (rev a2) Will add glxinfo log asap.
Created attachment 94479 [details] glxinfo for my hardware.
glxtest output : [fred@fredo-arch ~]$ ./glxtest VENDOR nouveau RENDERER Gallium 0.4 on NVA8 VERSION 3.0 Mesa 10.0.3 TFP TRUE Not a single hang. Will try seeing on full mozilla code if there is still a hang or not.
Emil tried the glxtest program earlier and didn't get a hang either. So at least that's consistent. (I'm adding you to the CC list, hope that's OK, if not, feel free to remove yourself.)
(In reply to comment #10) > Emil tried the glxtest program earlier and didn't get a hang either. So at > least that's consistent. (I'm adding you to the CC list, hope that's OK, if > not, feel free to remove yourself.) I don't mind. I wanted to do so :)
Yeah, sorry about that. The code is exactly the same as in Firefox, and it runs very close to startup, so I'm surprised that this doesn't reproduce.
The work-around that works for us, and that we're considering landing, is to avoid calling XCloseDisplay and, in this case, just calling XSync to handle any pending errors in glxtest. Digging in the Gecko codebase, I found at least another instance where we avoid calling XCloseDisplay because it is "scary": http://hg.mozilla.org/mozilla-central/file/2d5d5fd03050/dom/plugins/ipc/PluginModuleChild.cpp#l569 At the moment I am wondering if we should stop calling XCloseDisplay altogether...
Frederic, would you be able to use top to identify the program using 100% cpu, attach a debugger (gdb attach <pid>) while hanging, and get a stack trace, please?
(In reply to comment #14) > Frederic, would you be able to use top to identify the program using 100% > cpu, attach a debugger (gdb attach <pid>) while hanging, and get a stack > trace, please? Already done in mozilla bugzilla bug report. Adding this log here.
Created attachment 94625 [details] log asked for in comment #14. It is a copy of the log I produced on mozilla bugzilla report.
> Thread 1 (Thread 0x7f2d032c2740 (LWP 3445)): > #0 0x00007f2d02ebfa8d in read () from /usr/lib/libpthread.so.0 > #1 0x00007f2cfd97264a in mozilla::widget::GfxInfo::GetData() () > from /home/fred/Téléchargements/firefox/libxul.so I wouldn't expect a blocking read on a pipe to use 100% CPU. Is this the same process that top is showing using the CPU? If it is, can you use 'H' in top to identify the thread, please?
(In reply to comment #17) > > Thread 1 (Thread 0x7f2d032c2740 (LWP 3445)): > > #0 0x00007f2d02ebfa8d in read () from /usr/lib/libpthread.so.0 > > #1 0x00007f2cfd97264a in mozilla::widget::GfxInfo::GetData() () > > from /home/fred/Téléchargements/firefox/libxul.so > > I wouldn't expect a blocking read on a pipe to use 100% CPU. > Is this the same process that top is showing using the CPU? > If it is, can you use 'H' in top to identify the thread, please? Captured 2 minutes ago. When I launch Firefox : 7136 fred 20 0 309228 17808 8100 R 100,0 0,4 0:23.85 firefox H in top : 7136 fred 20 0 309228 17808 8100 R 99,8 0,4 1:29.11 firefox It is what I get.
A duplicate Mozilla bug has a stack with symbols: https://bugzilla.mozilla.org/show_bug.cgi?id=975512#c1 #0 0x0000003f3a2da007 in sched_yield () from /lib64/libc.so.6 #1 0x00007f1e003bb1d1 in nouveau_fence_wait () from /usr/lib64/dri/nouveau_dri.so #2 0x00007f1e0039d635 in nvc0_screen_destroy () from /usr/lib64/dri/nouveau_dri.so #3 0x00007f1e002c6d9f in dri_destroy_screen_helper () from /usr/lib64/dri/nouveau_dri.so #4 0x00007f1e002c6e25 in dri_destroy_screen () from /usr/lib64/dri/nouveau_dri.so #5 0x00007f1e000087eb in driDestroyScreen () from /usr/lib64/dri/nouveau_dri.so #6 0x000000321f646932 in dri2DestroyScreen () from /lib64/libGL.so.1 #7 0x000000321f61c526 in FreeScreenConfigs.isra.3 () from /lib64/libGL.so.1 #8 0x000000321f61c5a5 in glx_display_free () from /lib64/libGL.so.1 #9 0x000000321f61c6a5 in __glXCloseDisplay () from /lib64/libGL.so.1 #10 0x0000003f3e21fd22 in XCloseDisplay () from /lib64/libX11.so.6 #11 0x00007f1e09933dcf in glxtest() () from /home/sstangl/bin/nightly/libxul.so #12 0x00007f1e09934b36 in fire_glxtest_process() () from /home/sstangl/bin/nightly/libxul.so #13 0x00007f1e0992e655 in XREMain::XRE_mainInit(bool*) () from /home/sstangl/bin/nightly/libxul.so #14 0x00007f1e09932886 in XREMain::XRE_main(int, char**, nsXREAppData const*) () from /home/sstangl/bin/nightly/libxul.so ---Type <return> to continue, or q <return> to quit--- #15 0x00007f1e09932b94 in XRE_main () from /home/sstangl/bin/nightly/libxul.so #16 0x0000000000406b66 in do_main(int, char**, nsIFile*) () #17 0x000000000040383b in main ()
Hi Benoit IMHO the stack trace does not indicate where the problem occurs, just that there is one. - Are you guys using multi-threaded gl ? - Considering that the problem is not available with a standalone glxtest app, do you have a link to a FF build that one can test against ?
- Are you guys using multi-threaded gl ? Not here. We could be if Frederic was 1) using non-default preferences such as layers.acceleration.force-enabled and 2) viewing specific Web pages (e.g. using WebGL). But here we are talking about a startup hang occurring before this could kick into play. Frederic, can you say if you have any force-enabled prefs set in about:config? - Considering that the problem is not available with a standalone glxtest app, do you have a link to a FF build that one can test against ? Sure. You need a Nightly build, but not a current one since we have landed a work-around (avoiding to call XCloseDisplay here). This one, from Feb. 20, should reproduce: http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2014-02-20-03-02-02-mozilla-central/ E.g. http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2014-02-20-03-02-02-mozilla-central/firefox-30.0a1.en-US.linux-x86_64.tar.bz2
(In reply to comment #21) [...] > Frederic, can you say if you have any force-enabled prefs set in > about:config? None. All options are set with their default values.
Don't know if it is related, but I'm facing a plugin container hang and nouveau seems to be guilty in some way. Got this thread with gdb : Thread 1 (Thread 0x7f4dd7b37940 (LWP 2961)): #0 0x00007f4dd2294337 in sched_yield () from /usr/lib/libc.so.6 #1 0x00007f4dbff0ee41 in ?? () from /usr/lib/vdpau/libvdpau_nouveau.so.1 #2 0x00007f4dbff9eca5 in ?? () from /usr/lib/vdpau/libvdpau_nouveau.so.1 #3 0x00007f4dbfe019c1 in vl_screen_destroy () from /usr/lib/vdpau/libvdpau_nouveau.so.1 #4 0x00007f4dbff073a1 in vlVdpDeviceDestroy () from /usr/lib/vdpau/libvdpau_nouveau.so.1 #5 0x00007f4dc4c252a8 in ?? () from /usr/lib/mozilla/plugins/libflashplayer.so #6 0x00007f4dc48871b5 in ?? () from /usr/lib/mozilla/plugins/libflashplayer.so #7 0x00007f4dc4888000 in ?? () from /usr/lib/mozilla/plugins/libflashplayer.so #8 0x00007f4dc4e167c8 in ?? () from /usr/lib/mozilla/plugins/libflashplayer.so #9 0x00007f4dc4e1cf7d in ?? () from /usr/lib/mozilla/plugins/libflashplayer.so #10 0x00007f4dc4e1ea6b in ?? () from /usr/lib/mozilla/plugins/libflashplayer.so Cannot access memory at address 0x7ffff106a948 Or is it a flash plugin bug ?
(In reply to comment #23) > Don't know if it is related, but I'm facing a plugin container hang and > nouveau seems to be guilty in some way. > > Got this thread with gdb : > > Thread 1 (Thread 0x7f4dd7b37940 (LWP 2961)): > #0 0x00007f4dd2294337 in sched_yield () from /usr/lib/libc.so.6 > #1 0x00007f4dbff0ee41 in ?? () from /usr/lib/vdpau/libvdpau_nouveau.so.1 > #2 0x00007f4dbff9eca5 in ?? () from /usr/lib/vdpau/libvdpau_nouveau.so.1 > #3 0x00007f4dbfe019c1 in vl_screen_destroy () > from /usr/lib/vdpau/libvdpau_nouveau.so.1 > #4 0x00007f4dbff073a1 in vlVdpDeviceDestroy () > from /usr/lib/vdpau/libvdpau_nouveau.so.1 > #5 0x00007f4dc4c252a8 in ?? () from > /usr/lib/mozilla/plugins/libflashplayer.so > #6 0x00007f4dc48871b5 in ?? () from > /usr/lib/mozilla/plugins/libflashplayer.so > #7 0x00007f4dc4888000 in ?? () from > /usr/lib/mozilla/plugins/libflashplayer.so > #8 0x00007f4dc4e167c8 in ?? () from > /usr/lib/mozilla/plugins/libflashplayer.so > #9 0x00007f4dc4e1cf7d in ?? () from > /usr/lib/mozilla/plugins/libflashplayer.so > #10 0x00007f4dc4e1ea6b in ?? () from > /usr/lib/mozilla/plugins/libflashplayer.so > Cannot access memory at address 0x7ffff106a948 > > Or is it a flash plugin bug ? Seems to be stuck at the same spot with identical call stack, minus the state tracker bits. The memory access message looks interesting, shame there is no indication of what nouveau was doing to provoke this. (gdb) bt full, info locals or disassemble should be able to catch what is going on here. How often do you get this ? Please attach the gdb output of the above three as the problem occurs.
(In reply to comment #21) [snip] > Sure. You need a Nightly build, but not a current one since we have landed a > work-around (avoiding to call XCloseDisplay here). This one, from Feb. 20, > should reproduce: > > http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2014-02-20-03-02-02- > mozilla-central/ > > E.g. > http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2014-02-20-03-02-02- > mozilla-central/firefox-30.0a1.en-US.linux-x86_64.tar.bz2 I can reproduce the problem with the linked build, although the creation of 16 threads at startup will make things "a little" bit slow to resolve.
(In reply to comment #24) [...] > > Seems to be stuck at the same spot with identical call stack, minus the > state tracker bits. > > The memory access message looks interesting, shame there is no indication of > what nouveau was doing to provoke this. (gdb) bt full, info locals or > disassemble should be able to catch what is going on here. > > How often do you get this ? Please attach the gdb output of the above three > as the problem occurs. It happens when I try to connect to my youtube account. And after 30 seconds or so, plugin-container works as expected. Here is bt full log : (gdb) bt full #0 0x00007fb478a49337 in sched_yield () from /usr/lib/libc.so.6 No symbol table info available. #1 0x00007fb46670ee41 in ?? () from /usr/lib/vdpau/libvdpau_nouveau.so.1 No symbol table info available. #2 0x00007fb46679eca5 in ?? () from /usr/lib/vdpau/libvdpau_nouveau.so.1 No symbol table info available. #3 0x00007fb4666019c1 in vl_screen_destroy () from /usr/lib/vdpau/libvdpau_nouveau.so.1 No symbol table info available. #4 0x00007fb4667073a1 in vlVdpDeviceDestroy () from /usr/lib/vdpau/libvdpau_nouveau.so.1 No symbol table info available. #5 0x00007fb46b4252a8 in ?? () from /usr/lib/mozilla/plugins/libflashplayer.so No symbol table info available. #6 0x00007fb46b0871b5 in ?? () from /usr/lib/mozilla/plugins/libflashplayer.so No symbol table info available. #7 0x00007fb46b088000 in ?? () from /usr/lib/mozilla/plugins/libflashplayer.so No symbol table info available. #8 0x00007fb46b6167c8 in ?? () from /usr/lib/mozilla/plugins/libflashplayer.so No symbol table info available. #9 0x00007fb46b61cf7d in ?? () from /usr/lib/mozilla/plugins/libflashplayer.so No symbol table info available. #10 0x00007fb46b61ea6b in ?? () from /usr/lib/mozilla/plugins/libflashplayer.so No symbol table info available. Cannot access memory at address 0x7fffc4f85b78 Some technical infos : nouveau 1.0.10, mesa 10.0.3 All this on an archlinux 64 bits, up-to-date.
(In reply to comment #26) [...] > It happens when I try to connect to my youtube account. And after 30 seconds > or so, plugin-container works as expected. > > Here is bt full log : > > (gdb) bt full > #0 0x00007fb478a49337 in sched_yield () from /usr/lib/libc.so.6 > No symbol table info available. > #1 0x00007fb46670ee41 in ?? () from /usr/lib/vdpau/libvdpau_nouveau.so.1 > No symbol table info available. > #2 0x00007fb46679eca5 in ?? () from /usr/lib/vdpau/libvdpau_nouveau.so.1 > No symbol table info available. > #3 0x00007fb4666019c1 in vl_screen_destroy () > from /usr/lib/vdpau/libvdpau_nouveau.so.1 > No symbol table info available. > #4 0x00007fb4667073a1 in vlVdpDeviceDestroy () > from /usr/lib/vdpau/libvdpau_nouveau.so.1 > No symbol table info available. > #5 0x00007fb46b4252a8 in ?? () from > /usr/lib/mozilla/plugins/libflashplayer.so > No symbol table info available. > #6 0x00007fb46b0871b5 in ?? () from > /usr/lib/mozilla/plugins/libflashplayer.so > No symbol table info available. > #7 0x00007fb46b088000 in ?? () from > /usr/lib/mozilla/plugins/libflashplayer.so > No symbol table info available. > #8 0x00007fb46b6167c8 in ?? () from > /usr/lib/mozilla/plugins/libflashplayer.so > No symbol table info available. > #9 0x00007fb46b61cf7d in ?? () from > /usr/lib/mozilla/plugins/libflashplayer.so > No symbol table info available. > #10 0x00007fb46b61ea6b in ?? () from > /usr/lib/mozilla/plugins/libflashplayer.so > No symbol table info available. > Cannot access memory at address 0x7fffc4f85b78 > I was afraid that your mesa build does not have debug symbols. Would you mind rebuilding and installing the debug symbols ? * Can you grab the PKGBUILD [1] * Add "options=('debug')" around the source line * rebuild (makepkg) and install (do not forget the -debug* packages) > Some technical infos : nouveau 1.0.10, mesa 10.0.3 > > All this on an archlinux 64 bits, up-to-date. FWIW I cannot reproduce the crash, although sched_yield is still around Cheers [1] https://projects.archlinux.org/svntogit/packages.git/tree/trunk/PKGBUILD?h=packages/mesa
Some progress with debug packages. But I think I'll need xf86-video-nouveau debug package too. At least, here is the log : (gdb) bt full #0 0x00007f364503a337 in sched_yield () from /usr/lib/libc.so.6 No symbol table info available. #1 0x00007f3632d0e411 in nouveau_fence_wait (fence=0x7f3630760550) at nouveau_fence.c:212 screen = 0x7f3639e76c00 spins = 329500464 #2 0x00007f3632d9e115 in nv50_screen_destroy (pscreen=0x7f3639e76c00) at nv50/nv50_screen.c:286 No locals. #3 0x00007f3632c019c1 in vl_screen_destroy (vscreen=0x7f3639e5c980) at ../../../../src/gallium/auxiliary/vl/vl_winsys_dri.c:414 scrn = 0x7f3639e5c980 #4 0x00007f3632d069c1 in vlVdpDeviceDestroy (device=1) at device.c:174 dev = 0x7f3639eaf2d0 #5 0x00007f3637a252a8 in ?? () from /usr/lib/mozilla/plugins/libflashplayer.so No symbol table info available. #6 0x00007f36376871b5 in ?? () from /usr/lib/mozilla/plugins/libflashplayer.so No symbol table info available. #7 0x00007f3637688000 in ?? () from /usr/lib/mozilla/plugins/libflashplayer.so No symbol table info available. #8 0x00007f3637c167c8 in ?? () from /usr/lib/mozilla/plugins/libflashplayer.so No symbol table info available. #9 0x00007f3637c1cf7d in ?? () from /usr/lib/mozilla/plugins/libflashplayer.so No symbol table info available. #10 0x00007f3637c1ea6b in ?? () from /usr/lib/mozilla/plugins/libflashplayer.so No symbol table info available. #11 0x00007f3637c1ed8a in ?? () from /usr/lib/mozilla/plugins/libflashplayer.so No symbol table info available. #12 0x00007f3637880601 in ?? () from /usr/lib/mozilla/plugins/libflashplayer.so No symbol table info available. #13 0x00007f36379e5d31 in ?? () from /usr/lib/mozilla/plugins/libflashplayer.so No symbol table info available. #14 0x00007f36379e60b5 in ?? () from /usr/lib/mozilla/plugins/libflashplayer.so No symbol table info available. #15 0x00007f36379e4efd in ?? () from /usr/lib/mozilla/plugins/libflashplayer.so No symbol table info available. #16 0x00007f36379edae2 in ?? () from /usr/lib/mozilla/plugins/libflashplayer.so No symbol table info available. #17 0x00007f364751c347 in ?? () from /home/fred/Applications/firefox/libxul.so No symbol table info available. #18 0x00007f3646d94720 in ?? () from /home/fred/Applications/firefox/libxul.so No symbol table info available. #19 0x00007f3646d955bd in ?? () from /home/fred/Applications/firefox/libxul.so No symbol table info available. #20 0x00007f3646caff3c in ?? () from /home/fred/Applications/firefox/libxul.so No symbol table info available. #21 0x00007f3646cb12da in ?? () from /home/fred/Applications/firefox/libxul.so No symbol table info available. Cannot access memory at address 0x7fff90a93308
Looks like a NouVeau is triggered by Mozilla Firefox trunk code. I opened a bug on mozilla bugtracker, https://bugzilla.mozilla.org/show_bug.cgi?id=978439 Even if I find a "dirty workaround", valgrind logs are showing a NouVeau driver bug. I will attach both Mozilla Firefox valgrind log and glxtest one. But there is a common part : Invalid read of size 4 at 0x8DBA0BE: nouveau_fence_wait (nouveau_fence.c:205) by 0x8E4AFF4: nv50_screen_destroy (nv50_screen.c:293) by 0x8DAACDE: dri_destroy_screen_helper (dri_screen.c:378) by 0x8DAAD74: dri_destroy_screen (dri_screen.c:388) by 0x8AB13D6: driDestroyScreen (dri_util.c:234) by 0x69A06C1: dri2DestroyScreen (dri2_glx.c:712) by 0x697CFE5: FreeScreenConfigs.isra.3 (glxext.c:208) by 0x697D064: glx_display_free (glxext.c:231) by 0x697D164: __glXCloseDisplay (glxext.c:272) by 0x50569F1: XCloseDisplay (in /usr/lib/libX11.so.6.3.0)
Created attachment 95188 [details] Valgrind log for glxtest Here is the valgrind log for Glx test
Created attachment 95189 [details] Valgrind log with debug symbols for Mozilla Firefox trunk code. And log for mozilla Firefox trunk code. Always the same beginning : Invalid read of size 4 at 0x161590BE: nouveau_fence_wait (nouveau_fence.c:205) by 0x161E9FF4: nv50_screen_destroy (nv50_screen.c:293) by 0x16149CDE: dri_destroy_screen_helper (dri_screen.c:378) by 0x16149D74: dri_destroy_screen (dri_screen.c:388) by 0x15E503D6: driDestroyScreen (dri_util.c:234) by 0x1256D6C1: dri2DestroyScreen (dri2_glx.c:712) by 0x12549FE5: FreeScreenConfigs.isra.3 (glxext.c:208) by 0x1254A064: glx_display_free (glxext.c:231) by 0x1254A164: __glXCloseDisplay (glxext.c:272) by 0xEF009F1: XCloseDisplay (in /usr/lib/libX11.so.6.3.0) Any idea ?
...note that the log attached by Frederic in comment 30 also has a stack to where this pointer was free'd --- we have a use-after-free bug here! This was obtained by Frederic by running the glxtest program attached in comment 2 here, in valgrind: valgrind --track-origins=yes ./glxtest I hope that Nouveau developers here can reproduce this easily!
The stack to the free() points to line 203 here, while the stack to where the free'd data is subsequently used points to line 205 here: http://cgit.freedesktop.org/mesa/mesa/tree/src/gallium/drivers/nouveau/nouveau_fence.c?id=ce6dd69697ae62d9336bbd4f5808bc4d75cdcc04#n203 if (fence == screen->fence.current) nouveau_fence_next(screen); do { nouveau_fence_update(screen, FALSE); // <--- free here! if (fence->state == NOUVEAU_FENCE_STATE_SIGNALLED) // <-- use-after-free return TRUE; So it seems like nouveau_fence_update (which was apparently inlined) destroys the fence object... do you need to call nouveau_fence_ref() to keep it alive?
Also, here's the story of how that caused Firefox to keep spinning here. The mozilla change that made this bug noticeable (by having Firefox stuck for a minute there) was https://bugzilla.mozilla.org/show_bug.cgi?id=860254 . Before that, this use-after-free read was being lucky enough to read the data that was still present in memory even after the free(), so fence->state still had the expected value; but after that mozilla-side change, our memory allocator started immediately poisoning free'd memory, so fence->state had a totally wrong value, causing this loop here to spin indefinitely.
(In reply to comment #33) > The stack to the free() points to line 203 here, while the stack to where > the free'd data is subsequently used points to line 205 here: > > http://cgit.freedesktop.org/mesa/mesa/tree/src/gallium/drivers/nouveau/ > nouveau_fence.c?id=ce6dd69697ae62d9336bbd4f5808bc4d75cdcc04#n203 > > > if (fence == screen->fence.current) > nouveau_fence_next(screen); > > do { > nouveau_fence_update(screen, FALSE); // <--- free here! > > if (fence->state == NOUVEAU_FENCE_STATE_SIGNALLED) // <-- > use-after-free > return TRUE; > > > So it seems like nouveau_fence_update (which was apparently inlined) > destroys the fence object... do you need to call nouveau_fence_ref() to keep > it alive? This code is rather confusing. You have to keep in mind how it's used, which among other things is from the kick handler. I tried to fix it up with http://cgit.freedesktop.org/mesa/mesa/commit/?id=ce6dd69697ae62d9336bbd4f5808bc4d75cdcc04 But I guess it was insufficient? I have an odd recollection that I felt like the stuff in the context destroy was suspect, but I don't remember how. Since it wasn't directly related to my problem, I left it alone (esp since I was going under the assumption that it would only be triggered on exit, in which case it's harder to care). Unfortunately it was a long enough time ago that I've lost my context on this. I'm guessing that the key here is that there are multiple contexts and one screen. (Someone should confirm that to be the case.)
Can you reproduce yourself by running the attached glxtest program in valgrind?
(In reply to comment #36) > Can you reproduce yourself by running the attached glxtest program in > valgrind? Oh, I'm sure (can't right now, will try tonight). But I doubt it's the source of the bug. The glxtest program doesn't appear to hang. Memory is unlikely to be reallocated _so_ quickly, that ->state would change. (Although more likely in a multithreaded program... hm.)
I wasn't clear enough in comment 34, let me explain better :-) The Mozilla change that exposed this, https://bugzilla.mozilla.org/show_bug.cgi?id=860254, is exactly about having memory overwritten immediately on free(). So this is _exactly_ what is happening here :-) The reason why the stand-alone glxtest program doesn't hang is that it doesn't use Mozilla's modified memory allocator that overwrites memory on free(). If you ran it with this memory allocator, no doubt that it would hang there, as this loop would keep reading a wrong fence->status. So what you can hope to reproduce yourself with this glxtest program is not the hang, it's the valgrind-detected use-after-free, and I really believe that that is the root cause for the hang when running with Mozilla's memory allocator.
(In reply to comment #38) > I wasn't clear enough in comment 34, let me explain better :-) > > The Mozilla change that exposed this, > https://bugzilla.mozilla.org/show_bug.cgi?id=860254, is exactly about having > memory overwritten immediately on free(). So this is _exactly_ what is > happening here :-) Ah, I missed that little bit, sorry! You're almost certainly right then.
OK, so I'm obviously able to reproduce the use-after-free. Like I said, this logic is fairly subtle. I think the problem is that there's an implicit assumption that the fence has an external ref before someone waits on it. In the case of the current screen fence, that's not necessarily the case. I think the proper fix is to add a fence ref in the destructor around the wait. I'm going to think a bit more on this.
Created attachment 95205 [details] [review] patch to fix fence wait on screen destroy This fixes the valgrind warning for me, and makes logical sense (always a plus).
Frederic, Benoit, Thank you very much for the enormous help with triaging this bug. A few quick tests on my system indicate that the problem is resolved by the linked patch. AFAICS the patch is only in the master branch as it just missed the 10.0.4 and 10.1.0 release. The commit is annotated and will be included in the next mesa bug fix releases. Feel free to reopen the report if the bug is still around with the patch from commit 41.
Bug is fixed for good. Archlinux upgraded to Mesa 10.1.1 today, and after rebuilding firefox, no more freeze problem with nouveau. Thanks for the fix.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.