Bug 75279 - XCloseDisplay() takes one minute around nouveau_dri.so, freezing Firefox startup
XCloseDisplay() takes one minute around nouveau_dri.so, freezing Firefox startup
Status: RESOLVED FIXED
Product: xorg
Classification: Unclassified
Component: Driver/nouveau
unspecified
x86-64 (AMD64) Linux (All)
: medium normal
Assigned To: Nouveau Project
Xorg Project Team
https://bugzilla.mozilla.org/show_bug...
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-02-20 20:45 UTC by Benoit Jacob
Modified: 2014-04-19 13:04 UTC (History)
2 users (show)

See Also:


Attachments
Stand-alone version of the glxtest program (6.96 KB, text/plain)
2014-02-21 00:14 UTC, Benoit Jacob
no flags Details
glxinfo for my hardware. (47.41 KB, text/plain)
2014-02-21 07:36 UTC, Frederic Bezies
no flags Details
log asked for in comment #14. (22.34 KB, text/plain)
2014-02-23 22:16 UTC, Frederic Bezies
no flags Details
Valgrind log for glxtest (2.65 KB, text/plain)
2014-03-05 20:10 UTC, Frederic Bezies
no flags Details
Valgrind log with debug symbols for Mozilla Firefox trunk code. (3.07 KB, text/plain)
2014-03-05 20:12 UTC, Frederic Bezies
no flags Details
patch to fix fence wait on screen destroy (3.63 KB, patch)
2014-03-06 03:30 UTC, Ilia Mirkin
no flags Details | Splinter Review

Note You need to log in before you can comment on or make changes to this bug.
Description Benoit Jacob 2014-02-20 20:45:32 UTC
Hi, I'm just forwarding to you a bug report originally made against Firefox:
https://bugzilla.mozilla.org/show_bug.cgi?id=973192

The symptom is Firefox freezing on startup. The cause (see around comment 39--42 on that bug) is that a XCloseDisplay() call takes about one minute to return.

The original reporter of the bug has recorded a profile which shows time spent in nouveau_dri.so (see comments 50, 51 on the mozilla bug)

If you want to see what the application code triggering this bug looks like, it's all self-contained in this file:

http://hg.mozilla.org/mozilla-central/file/tip/toolkit/xre/glxtest.cpp

   228   glXMakeCurrent(dpy, None, nullptr);
   229   glXDestroyContext(dpy, context);
   230   XDestroyWindow(dpy, window);
   231   XFreeColormap(dpy, swa.colormap);
   232   XCloseDisplay(dpy);  // <--- this XCloseDisplay call takes 1 minute
   233   dlclose(libgl);

Is there some more information that you need, that isn't already on the mozilla bug?

Is this application code doing something evil here?

Is this a known issue? Do you have a guess as to whether it might affect many users (i.e., should we try working around it)?
Comment 1 Emil Velikov 2014-02-20 23:40:10 UTC
Hi Benoit Jacob

Can you attach the output of glxinfo and dmesg after you've come across this issue ? Do you have a link to a build we can reuse ?

-Emil
Comment 2 Emil Velikov 2014-02-20 23:48:59 UTC
If you can hack out the glxtest into a separate program that would ease all of use alot. Cheers
Comment 3 Benoit Jacob 2014-02-21 00:14:50 UTC
Created attachment 94465 [details]
Stand-alone version of the glxtest program

Here you go.

build like this:
   c++ glxtest.cpp -o glxtest -ldl -lX11
run like this:
   ./glxtest
Example output:
  VENDOR
  Tungsten Graphics, Inc
  RENDERER
  Mesa DRI Intel(R) Sandybridge Mobile
  VERSION
  3.0 Mesa 8.0.4
  TFP
  TRUE
Comment 4 Benoit Jacob 2014-02-21 00:15:54 UTC
I have also asked the original reporter to share his glxinfo output and details about his setup. You might want to CC yourself on the mozilla bug.
Comment 5 Ilia Mirkin 2014-02-21 00:20:19 UTC
(In reply to comment #4)
> I have also asked the original reporter to share his glxinfo output and
> details about his setup. You might want to CC yourself on the mozilla bug.

Might be easier to get the original reporter onto this one...

Can you confirm that the standalone glxtest causes the reported 1-minute stall as well?
Comment 6 Benoit Jacob 2014-02-21 00:26:47 UTC
Good point, and good question, which I've forwarded on the mozilla bug.
Comment 7 Frederic Bezies 2014-02-21 07:35:26 UTC
I'm the reporter on mozilla bugzilla tracker.

Some technical infos on my computer.

Using archlinux, 64 bits.

Software versions ? Kernel upgraded less than an hour ago.

[fred@fredo-arch ~]$ pacman -Si mesa | grep Version
Version               : 10.0.3-1
[fred@fredo-arch ~]$ pacman -Si xf86-video-nouveau | grep Version
Version               : 1.0.10-2
[fred@fredo-arch ~]$ pacman -Si linux | grep Version
Version               : 3.13.4-1

Hardware ?
[fred@fredo-arch ~]$ lspci | grep VGA
02:00.0 VGA compatible controller: NVIDIA Corporation GT218 [GeForce 210] (rev a2)

Will add glxinfo log asap.
Comment 8 Frederic Bezies 2014-02-21 07:36:33 UTC
Created attachment 94479 [details]
glxinfo for my hardware.
Comment 9 Frederic Bezies 2014-02-21 07:40:39 UTC
glxtest output :

[fred@fredo-arch ~]$ ./glxtest 
VENDOR
nouveau
RENDERER
Gallium 0.4 on NVA8
VERSION
3.0 Mesa 10.0.3
TFP
TRUE

Not a single hang. Will try seeing on full mozilla code if there is still a hang or not.
Comment 10 Ilia Mirkin 2014-02-21 07:45:36 UTC
Emil tried the glxtest program earlier and didn't get a hang either. So at least that's consistent. (I'm adding you to the CC list, hope that's OK, if not, feel free to remove yourself.)
Comment 11 Frederic Bezies 2014-02-21 09:58:19 UTC
(In reply to comment #10)
> Emil tried the glxtest program earlier and didn't get a hang either. So at
> least that's consistent. (I'm adding you to the CC list, hope that's OK, if
> not, feel free to remove yourself.)

I don't mind. I wanted to do so :)
Comment 12 Benoit Jacob 2014-02-21 13:44:10 UTC
Yeah, sorry about that. The code is exactly the same as in Firefox, and it runs very close to startup, so I'm surprised that this doesn't reproduce.
Comment 13 Benoit Jacob 2014-02-22 20:34:51 UTC
The work-around that works for us, and that we're considering landing, is to avoid calling XCloseDisplay and, in this case, just calling XSync to handle any pending errors in glxtest. Digging in the Gecko codebase, I found at least another instance where we avoid calling XCloseDisplay because it is "scary":

http://hg.mozilla.org/mozilla-central/file/2d5d5fd03050/dom/plugins/ipc/PluginModuleChild.cpp#l569

At the moment I am wondering if we should stop calling XCloseDisplay altogether...
Comment 14 Karl Tomlinson 2014-02-23 22:12:38 UTC
Frederic, would you be able to use top to identify the program using 100% cpu, attach a debugger (gdb attach <pid>) while hanging, and get a stack trace, please?
Comment 15 Frederic Bezies 2014-02-23 22:15:21 UTC
(In reply to comment #14)
> Frederic, would you be able to use top to identify the program using 100%
> cpu, attach a debugger (gdb attach <pid>) while hanging, and get a stack
> trace, please?

Already done in mozilla bugzilla bug report.

Adding this log here.
Comment 16 Frederic Bezies 2014-02-23 22:16:18 UTC
Created attachment 94625 [details]
log asked for in comment #14.

It is a copy of the log I produced on mozilla bugzilla report.
Comment 17 Karl Tomlinson 2014-02-23 22:25:45 UTC
> Thread 1 (Thread 0x7f2d032c2740 (LWP 3445)):
> #0  0x00007f2d02ebfa8d in read () from /usr/lib/libpthread.so.0
> #1  0x00007f2cfd97264a in mozilla::widget::GfxInfo::GetData() ()
>    from /home/fred/Téléchargements/firefox/libxul.so

I wouldn't expect a blocking read on a pipe to use 100% CPU.
Is this the same process that top is showing using the CPU?
If it is, can you use 'H' in top to identify the thread, please?
Comment 18 Frederic Bezies 2014-02-24 08:19:14 UTC
(In reply to comment #17)
> > Thread 1 (Thread 0x7f2d032c2740 (LWP 3445)):
> > #0  0x00007f2d02ebfa8d in read () from /usr/lib/libpthread.so.0
> > #1  0x00007f2cfd97264a in mozilla::widget::GfxInfo::GetData() ()
> >    from /home/fred/Téléchargements/firefox/libxul.so
> 
> I wouldn't expect a blocking read on a pipe to use 100% CPU.
> Is this the same process that top is showing using the CPU?
> If it is, can you use 'H' in top to identify the thread, please?

Captured 2 minutes ago. When I launch Firefox :

7136 fred      20   0  309228  17808   8100 R 100,0  0,4   0:23.85 firefox

H in top :

7136 fred      20   0  309228  17808   8100 R 99,8  0,4   1:29.11 firefox 

It is what I get.
Comment 19 Benoit Jacob 2014-02-25 14:22:25 UTC
A duplicate Mozilla bug has a stack with symbols:

https://bugzilla.mozilla.org/show_bug.cgi?id=975512#c1

#0  0x0000003f3a2da007 in sched_yield () from /lib64/libc.so.6
#1  0x00007f1e003bb1d1 in nouveau_fence_wait ()
   from /usr/lib64/dri/nouveau_dri.so
#2  0x00007f1e0039d635 in nvc0_screen_destroy ()
   from /usr/lib64/dri/nouveau_dri.so
#3  0x00007f1e002c6d9f in dri_destroy_screen_helper ()
   from /usr/lib64/dri/nouveau_dri.so
#4  0x00007f1e002c6e25 in dri_destroy_screen ()
   from /usr/lib64/dri/nouveau_dri.so
#5  0x00007f1e000087eb in driDestroyScreen ()
   from /usr/lib64/dri/nouveau_dri.so
#6  0x000000321f646932 in dri2DestroyScreen () from /lib64/libGL.so.1
#7  0x000000321f61c526 in FreeScreenConfigs.isra.3 () from /lib64/libGL.so.1
#8  0x000000321f61c5a5 in glx_display_free () from /lib64/libGL.so.1
#9  0x000000321f61c6a5 in __glXCloseDisplay () from /lib64/libGL.so.1
#10 0x0000003f3e21fd22 in XCloseDisplay () from /lib64/libX11.so.6
#11 0x00007f1e09933dcf in glxtest() () from /home/sstangl/bin/nightly/libxul.so
#12 0x00007f1e09934b36 in fire_glxtest_process() ()
   from /home/sstangl/bin/nightly/libxul.so
#13 0x00007f1e0992e655 in XREMain::XRE_mainInit(bool*) ()
   from /home/sstangl/bin/nightly/libxul.so
#14 0x00007f1e09932886 in XREMain::XRE_main(int, char**, nsXREAppData const*)
    () from /home/sstangl/bin/nightly/libxul.so
---Type <return> to continue, or q <return> to quit---
#15 0x00007f1e09932b94 in XRE_main () from /home/sstangl/bin/nightly/libxul.so
#16 0x0000000000406b66 in do_main(int, char**, nsIFile*) ()
#17 0x000000000040383b in main ()
Comment 20 Emil Velikov 2014-02-25 16:39:08 UTC
Hi Benoit

IMHO the stack trace does not indicate where the problem occurs, just that there is one.
 - Are you guys using multi-threaded gl ?
 - Considering that the problem is not available with a standalone glxtest app, do you have a link to a FF build that one can test against ?
Comment 21 Benoit Jacob 2014-02-25 17:17:07 UTC
 - Are you guys using multi-threaded gl ?

Not here. We could be if Frederic was 1) using non-default preferences such as layers.acceleration.force-enabled and 2) viewing specific Web pages (e.g. using WebGL). But here we are talking about a startup hang occurring before this could kick into play.

Frederic, can you say if you have any force-enabled prefs set in about:config?

 - Considering that the problem is not available with a standalone glxtest app, do you have a link to a FF build that one can test against ?

Sure. You need a Nightly build, but not a current one since we have landed a work-around (avoiding to call XCloseDisplay here). This one, from Feb. 20, should reproduce:

http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2014-02-20-03-02-02-mozilla-central/

E.g. http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2014-02-20-03-02-02-mozilla-central/firefox-30.0a1.en-US.linux-x86_64.tar.bz2
Comment 22 Frederic Bezies 2014-02-25 18:02:14 UTC
(In reply to comment #21)
[...] 
> Frederic, can you say if you have any force-enabled prefs set in
> about:config?

None. All options are set with their default values.
Comment 23 Frederic Bezies 2014-03-01 10:18:59 UTC
Don't know if it is related, but I'm facing a plugin container hang and nouveau seems to be guilty in some way.

Got this thread with gdb :

Thread 1 (Thread 0x7f4dd7b37940 (LWP 2961)):
#0  0x00007f4dd2294337 in sched_yield () from /usr/lib/libc.so.6
#1  0x00007f4dbff0ee41 in ?? () from /usr/lib/vdpau/libvdpau_nouveau.so.1
#2  0x00007f4dbff9eca5 in ?? () from /usr/lib/vdpau/libvdpau_nouveau.so.1
#3  0x00007f4dbfe019c1 in vl_screen_destroy ()
   from /usr/lib/vdpau/libvdpau_nouveau.so.1
#4  0x00007f4dbff073a1 in vlVdpDeviceDestroy ()
   from /usr/lib/vdpau/libvdpau_nouveau.so.1
#5  0x00007f4dc4c252a8 in ?? () from /usr/lib/mozilla/plugins/libflashplayer.so
#6  0x00007f4dc48871b5 in ?? () from /usr/lib/mozilla/plugins/libflashplayer.so
#7  0x00007f4dc4888000 in ?? () from /usr/lib/mozilla/plugins/libflashplayer.so
#8  0x00007f4dc4e167c8 in ?? () from /usr/lib/mozilla/plugins/libflashplayer.so
#9  0x00007f4dc4e1cf7d in ?? () from /usr/lib/mozilla/plugins/libflashplayer.so
#10 0x00007f4dc4e1ea6b in ?? () from /usr/lib/mozilla/plugins/libflashplayer.so
Cannot access memory at address 0x7ffff106a948

Or is it a flash plugin bug ?
Comment 24 Emil Velikov 2014-03-01 12:17:49 UTC
(In reply to comment #23)
> Don't know if it is related, but I'm facing a plugin container hang and
> nouveau seems to be guilty in some way.
> 
> Got this thread with gdb :
> 
> Thread 1 (Thread 0x7f4dd7b37940 (LWP 2961)):
> #0  0x00007f4dd2294337 in sched_yield () from /usr/lib/libc.so.6
> #1  0x00007f4dbff0ee41 in ?? () from /usr/lib/vdpau/libvdpau_nouveau.so.1
> #2  0x00007f4dbff9eca5 in ?? () from /usr/lib/vdpau/libvdpau_nouveau.so.1
> #3  0x00007f4dbfe019c1 in vl_screen_destroy ()
>    from /usr/lib/vdpau/libvdpau_nouveau.so.1
> #4  0x00007f4dbff073a1 in vlVdpDeviceDestroy ()
>    from /usr/lib/vdpau/libvdpau_nouveau.so.1
> #5  0x00007f4dc4c252a8 in ?? () from
> /usr/lib/mozilla/plugins/libflashplayer.so
> #6  0x00007f4dc48871b5 in ?? () from
> /usr/lib/mozilla/plugins/libflashplayer.so
> #7  0x00007f4dc4888000 in ?? () from
> /usr/lib/mozilla/plugins/libflashplayer.so
> #8  0x00007f4dc4e167c8 in ?? () from
> /usr/lib/mozilla/plugins/libflashplayer.so
> #9  0x00007f4dc4e1cf7d in ?? () from
> /usr/lib/mozilla/plugins/libflashplayer.so
> #10 0x00007f4dc4e1ea6b in ?? () from
> /usr/lib/mozilla/plugins/libflashplayer.so
> Cannot access memory at address 0x7ffff106a948
> 
> Or is it a flash plugin bug ?

Seems to be stuck at the same spot with identical call stack, minus the state tracker bits.

The memory access message looks interesting, shame there is no indication of what nouveau was doing to provoke this. (gdb) bt full, info locals or disassemble should be able to catch what is going on here.

How often do you get this ? Please attach the gdb output of the above three as the problem occurs.
Comment 25 Emil Velikov 2014-03-01 12:20:59 UTC
(In reply to comment #21)
[snip]
> Sure. You need a Nightly build, but not a current one since we have landed a
> work-around (avoiding to call XCloseDisplay here). This one, from Feb. 20,
> should reproduce:
> 
> http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2014-02-20-03-02-02-
> mozilla-central/
> 
> E.g.
> http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/2014-02-20-03-02-02-
> mozilla-central/firefox-30.0a1.en-US.linux-x86_64.tar.bz2

I can reproduce the problem with the linked build, although the creation of 16 threads at startup will make things "a little" bit slow to resolve.
Comment 26 Frederic Bezies 2014-03-01 13:53:09 UTC
(In reply to comment #24)
[...]
> 
> Seems to be stuck at the same spot with identical call stack, minus the
> state tracker bits.
> 
> The memory access message looks interesting, shame there is no indication of
> what nouveau was doing to provoke this. (gdb) bt full, info locals or
> disassemble should be able to catch what is going on here.
> 
> How often do you get this ? Please attach the gdb output of the above three
> as the problem occurs.

It happens when I try to connect to my youtube account. And after 30 seconds or so, plugin-container works as expected.

Here is bt full log :

(gdb) bt full
#0  0x00007fb478a49337 in sched_yield () from /usr/lib/libc.so.6
No symbol table info available.
#1  0x00007fb46670ee41 in ?? () from /usr/lib/vdpau/libvdpau_nouveau.so.1
No symbol table info available.
#2  0x00007fb46679eca5 in ?? () from /usr/lib/vdpau/libvdpau_nouveau.so.1
No symbol table info available.
#3  0x00007fb4666019c1 in vl_screen_destroy ()
   from /usr/lib/vdpau/libvdpau_nouveau.so.1
No symbol table info available.
#4  0x00007fb4667073a1 in vlVdpDeviceDestroy ()
   from /usr/lib/vdpau/libvdpau_nouveau.so.1
No symbol table info available.
#5  0x00007fb46b4252a8 in ?? () from /usr/lib/mozilla/plugins/libflashplayer.so
No symbol table info available.
#6  0x00007fb46b0871b5 in ?? () from /usr/lib/mozilla/plugins/libflashplayer.so
No symbol table info available.
#7  0x00007fb46b088000 in ?? () from /usr/lib/mozilla/plugins/libflashplayer.so
No symbol table info available.
#8  0x00007fb46b6167c8 in ?? () from /usr/lib/mozilla/plugins/libflashplayer.so
No symbol table info available.
#9  0x00007fb46b61cf7d in ?? () from /usr/lib/mozilla/plugins/libflashplayer.so
No symbol table info available.
#10 0x00007fb46b61ea6b in ?? () from /usr/lib/mozilla/plugins/libflashplayer.so
No symbol table info available.
Cannot access memory at address 0x7fffc4f85b78

Some technical infos : nouveau 1.0.10, mesa 10.0.3

All this on an archlinux 64 bits, up-to-date.
Comment 27 Emil Velikov 2014-03-01 14:38:49 UTC
(In reply to comment #26)
[...]
> It happens when I try to connect to my youtube account. And after 30 seconds
> or so, plugin-container works as expected.
> 
> Here is bt full log :
> 
> (gdb) bt full
> #0  0x00007fb478a49337 in sched_yield () from /usr/lib/libc.so.6
> No symbol table info available.
> #1  0x00007fb46670ee41 in ?? () from /usr/lib/vdpau/libvdpau_nouveau.so.1
> No symbol table info available.
> #2  0x00007fb46679eca5 in ?? () from /usr/lib/vdpau/libvdpau_nouveau.so.1
> No symbol table info available.
> #3  0x00007fb4666019c1 in vl_screen_destroy ()
>    from /usr/lib/vdpau/libvdpau_nouveau.so.1
> No symbol table info available.
> #4  0x00007fb4667073a1 in vlVdpDeviceDestroy ()
>    from /usr/lib/vdpau/libvdpau_nouveau.so.1
> No symbol table info available.
> #5  0x00007fb46b4252a8 in ?? () from
> /usr/lib/mozilla/plugins/libflashplayer.so
> No symbol table info available.
> #6  0x00007fb46b0871b5 in ?? () from
> /usr/lib/mozilla/plugins/libflashplayer.so
> No symbol table info available.
> #7  0x00007fb46b088000 in ?? () from
> /usr/lib/mozilla/plugins/libflashplayer.so
> No symbol table info available.
> #8  0x00007fb46b6167c8 in ?? () from
> /usr/lib/mozilla/plugins/libflashplayer.so
> No symbol table info available.
> #9  0x00007fb46b61cf7d in ?? () from
> /usr/lib/mozilla/plugins/libflashplayer.so
> No symbol table info available.
> #10 0x00007fb46b61ea6b in ?? () from
> /usr/lib/mozilla/plugins/libflashplayer.so
> No symbol table info available.
> Cannot access memory at address 0x7fffc4f85b78
> 
I was afraid that your mesa build does not have debug symbols. Would you mind rebuilding and installing the debug symbols ?

* Can you grab the PKGBUILD [1]
* Add "options=('debug')" around the source line
* rebuild (makepkg) and install (do not forget the -debug* packages)

> Some technical infos : nouveau 1.0.10, mesa 10.0.3
> 
> All this on an archlinux 64 bits, up-to-date.

FWIW I cannot reproduce the crash, although sched_yield is still around

Cheers

[1] https://projects.archlinux.org/svntogit/packages.git/tree/trunk/PKGBUILD?h=packages/mesa
Comment 28 Frederic Bezies 2014-03-01 15:26:09 UTC
Some progress with debug packages. But I think I'll need xf86-video-nouveau debug package too.

At least, here is the log :

(gdb) bt full
#0  0x00007f364503a337 in sched_yield () from /usr/lib/libc.so.6
No symbol table info available.
#1  0x00007f3632d0e411 in nouveau_fence_wait (fence=0x7f3630760550)
    at nouveau_fence.c:212
        screen = 0x7f3639e76c00
        spins = 329500464
#2  0x00007f3632d9e115 in nv50_screen_destroy (pscreen=0x7f3639e76c00)
    at nv50/nv50_screen.c:286
No locals.
#3  0x00007f3632c019c1 in vl_screen_destroy (vscreen=0x7f3639e5c980)
    at ../../../../src/gallium/auxiliary/vl/vl_winsys_dri.c:414
        scrn = 0x7f3639e5c980
#4  0x00007f3632d069c1 in vlVdpDeviceDestroy (device=1) at device.c:174
        dev = 0x7f3639eaf2d0
#5  0x00007f3637a252a8 in ?? () from /usr/lib/mozilla/plugins/libflashplayer.so
No symbol table info available.
#6  0x00007f36376871b5 in ?? () from /usr/lib/mozilla/plugins/libflashplayer.so
No symbol table info available.
#7  0x00007f3637688000 in ?? () from /usr/lib/mozilla/plugins/libflashplayer.so
No symbol table info available.
#8  0x00007f3637c167c8 in ?? () from /usr/lib/mozilla/plugins/libflashplayer.so
No symbol table info available.
#9  0x00007f3637c1cf7d in ?? () from /usr/lib/mozilla/plugins/libflashplayer.so
No symbol table info available.
#10 0x00007f3637c1ea6b in ?? () from /usr/lib/mozilla/plugins/libflashplayer.so
No symbol table info available.
#11 0x00007f3637c1ed8a in ?? () from /usr/lib/mozilla/plugins/libflashplayer.so
No symbol table info available.
#12 0x00007f3637880601 in ?? () from /usr/lib/mozilla/plugins/libflashplayer.so
No symbol table info available.
#13 0x00007f36379e5d31 in ?? () from /usr/lib/mozilla/plugins/libflashplayer.so
No symbol table info available.
#14 0x00007f36379e60b5 in ?? () from /usr/lib/mozilla/plugins/libflashplayer.so
No symbol table info available.
#15 0x00007f36379e4efd in ?? () from /usr/lib/mozilla/plugins/libflashplayer.so
No symbol table info available.
#16 0x00007f36379edae2 in ?? () from /usr/lib/mozilla/plugins/libflashplayer.so
No symbol table info available.
#17 0x00007f364751c347 in ?? () from /home/fred/Applications/firefox/libxul.so
No symbol table info available.
#18 0x00007f3646d94720 in ?? () from /home/fred/Applications/firefox/libxul.so
No symbol table info available.
#19 0x00007f3646d955bd in ?? () from /home/fred/Applications/firefox/libxul.so
No symbol table info available.
#20 0x00007f3646caff3c in ?? () from /home/fred/Applications/firefox/libxul.so
No symbol table info available.
#21 0x00007f3646cb12da in ?? () from /home/fred/Applications/firefox/libxul.so
No symbol table info available.
Cannot access memory at address 0x7fff90a93308
Comment 29 Frederic Bezies 2014-03-05 20:09:56 UTC
Looks like a NouVeau is triggered by Mozilla Firefox trunk code.

I opened a bug on mozilla bugtracker, https://bugzilla.mozilla.org/show_bug.cgi?id=978439

Even if I find a "dirty workaround", valgrind logs are showing a NouVeau driver bug.

I will attach both Mozilla Firefox valgrind log and glxtest one.

But there is a common part :

Invalid read of size 4
    at 0x8DBA0BE: nouveau_fence_wait (nouveau_fence.c:205)
    by 0x8E4AFF4: nv50_screen_destroy (nv50_screen.c:293)
    by 0x8DAACDE: dri_destroy_screen_helper (dri_screen.c:378)
    by 0x8DAAD74: dri_destroy_screen (dri_screen.c:388)
    by 0x8AB13D6: driDestroyScreen (dri_util.c:234)
    by 0x69A06C1: dri2DestroyScreen (dri2_glx.c:712)
    by 0x697CFE5: FreeScreenConfigs.isra.3 (glxext.c:208)
    by 0x697D064: glx_display_free (glxext.c:231)
    by 0x697D164: __glXCloseDisplay (glxext.c:272)
    by 0x50569F1: XCloseDisplay (in /usr/lib/libX11.so.6.3.0)
Comment 30 Frederic Bezies 2014-03-05 20:10:37 UTC
Created attachment 95188 [details]
Valgrind log for glxtest

Here is the valgrind log for Glx test
Comment 31 Frederic Bezies 2014-03-05 20:12:52 UTC
Created attachment 95189 [details]
Valgrind log with debug symbols for Mozilla Firefox trunk code.

And log for mozilla Firefox trunk code.

Always the same beginning :

Invalid read of size 4
   at 0x161590BE: nouveau_fence_wait (nouveau_fence.c:205)
    by 0x161E9FF4: nv50_screen_destroy (nv50_screen.c:293)
    by 0x16149CDE: dri_destroy_screen_helper (dri_screen.c:378)
    by 0x16149D74: dri_destroy_screen (dri_screen.c:388)
    by 0x15E503D6: driDestroyScreen (dri_util.c:234)
    by 0x1256D6C1: dri2DestroyScreen (dri2_glx.c:712)
    by 0x12549FE5: FreeScreenConfigs.isra.3 (glxext.c:208)
    by 0x1254A064: glx_display_free (glxext.c:231)
    by 0x1254A164: __glXCloseDisplay (glxext.c:272)
    by 0xEF009F1: XCloseDisplay (in /usr/lib/libX11.so.6.3.0)

Any idea ?
Comment 32 Benoit Jacob 2014-03-05 20:16:09 UTC
...note that the log attached by Frederic in comment 30 also has a stack to where this pointer was free'd --- we have a use-after-free bug here!

This was obtained by Frederic by running the glxtest program attached in comment 2 here, in valgrind:

  valgrind --track-origins=yes ./glxtest

I hope that Nouveau developers here can reproduce this easily!
Comment 33 Benoit Jacob 2014-03-05 20:30:26 UTC
The stack to the free() points to line 203 here, while the stack to where the free'd data is subsequently used points to line 205 here:

http://cgit.freedesktop.org/mesa/mesa/tree/src/gallium/drivers/nouveau/nouveau_fence.c?id=ce6dd69697ae62d9336bbd4f5808bc4d75cdcc04#n203


   if (fence == screen->fence.current)
      nouveau_fence_next(screen);

   do {
      nouveau_fence_update(screen, FALSE);  // <--- free here!

      if (fence->state == NOUVEAU_FENCE_STATE_SIGNALLED) // <-- use-after-free
         return TRUE;


So it seems like nouveau_fence_update (which was apparently inlined) destroys the fence object... do you need to call nouveau_fence_ref() to keep it alive?
Comment 34 Benoit Jacob 2014-03-05 20:41:24 UTC
Also, here's the story of how that caused Firefox to keep spinning here. The mozilla change that made this bug noticeable (by having Firefox stuck for a minute there) was https://bugzilla.mozilla.org/show_bug.cgi?id=860254 . Before that, this use-after-free read was being lucky enough to read the data that was still present in memory even after the free(), so fence->state still had the expected value; but after that mozilla-side change, our memory allocator started immediately poisoning free'd memory, so fence->state had a totally wrong value, causing this loop here to spin indefinitely.
Comment 35 Ilia Mirkin 2014-03-05 20:46:25 UTC
(In reply to comment #33)
> The stack to the free() points to line 203 here, while the stack to where
> the free'd data is subsequently used points to line 205 here:
> 
> http://cgit.freedesktop.org/mesa/mesa/tree/src/gallium/drivers/nouveau/
> nouveau_fence.c?id=ce6dd69697ae62d9336bbd4f5808bc4d75cdcc04#n203
> 
> 
>    if (fence == screen->fence.current)
>       nouveau_fence_next(screen);
> 
>    do {
>       nouveau_fence_update(screen, FALSE);  // <--- free here!
> 
>       if (fence->state == NOUVEAU_FENCE_STATE_SIGNALLED) // <--
> use-after-free
>          return TRUE;
> 
> 
> So it seems like nouveau_fence_update (which was apparently inlined)
> destroys the fence object... do you need to call nouveau_fence_ref() to keep
> it alive?

This code is rather confusing. You have to keep in mind how it's used, which among other things is from the kick handler. I tried to fix it up with

http://cgit.freedesktop.org/mesa/mesa/commit/?id=ce6dd69697ae62d9336bbd4f5808bc4d75cdcc04

But I guess it was insufficient? I have an odd recollection that I felt like the stuff in the context destroy was suspect, but I don't remember how. Since it wasn't directly related to my problem, I left it alone (esp since I was going under the assumption that it would only be triggered on exit, in which case it's harder to care). Unfortunately it was a long enough time ago that I've lost my context on this. I'm guessing that the key here is that there are multiple contexts and one screen. (Someone should confirm that to be the case.)
Comment 36 Benoit Jacob 2014-03-05 20:51:18 UTC
Can you reproduce yourself by running the attached glxtest program in valgrind?
Comment 37 Ilia Mirkin 2014-03-05 20:53:27 UTC
(In reply to comment #36)
> Can you reproduce yourself by running the attached glxtest program in
> valgrind?

Oh, I'm sure (can't right now, will try tonight). But I doubt it's the source of the bug. The glxtest program doesn't appear to hang. Memory is unlikely to be reallocated _so_ quickly, that ->state would change. (Although more likely in a multithreaded program... hm.)
Comment 38 Benoit Jacob 2014-03-05 20:57:36 UTC
I wasn't clear enough in comment 34, let me explain better :-)

The Mozilla change that exposed this, https://bugzilla.mozilla.org/show_bug.cgi?id=860254, is exactly about having memory overwritten immediately on free(). So this is _exactly_ what is happening here :-)

The reason why the stand-alone glxtest program doesn't hang is that it doesn't use Mozilla's modified memory allocator that overwrites memory on free(). If you ran it with this memory allocator, no doubt that it would hang there, as this loop would keep reading a wrong fence->status.

So what you can hope to reproduce yourself with this glxtest program is not the hang, it's the valgrind-detected use-after-free, and I really believe that that is the root cause for the hang when running with Mozilla's memory allocator.
Comment 39 Ilia Mirkin 2014-03-05 20:59:31 UTC
(In reply to comment #38)
> I wasn't clear enough in comment 34, let me explain better :-)
> 
> The Mozilla change that exposed this,
> https://bugzilla.mozilla.org/show_bug.cgi?id=860254, is exactly about having
> memory overwritten immediately on free(). So this is _exactly_ what is
> happening here :-)

Ah, I missed that little bit, sorry! You're almost certainly right then.
Comment 40 Ilia Mirkin 2014-03-06 03:14:18 UTC
OK, so I'm obviously able to reproduce the use-after-free. Like I said, this logic is fairly subtle. I think the problem is that there's an implicit assumption that the fence has an external ref before someone waits on it. In the case of the current screen fence, that's not necessarily the case. I think the proper fix is to add a fence ref in the destructor around the wait. I'm going to think a bit more on this.
Comment 41 Ilia Mirkin 2014-03-06 03:30:20 UTC
Created attachment 95205 [details] [review]
patch to fix fence wait on screen destroy

This fixes the valgrind warning for me, and makes logical sense (always a plus).
Comment 42 Emil Velikov 2014-03-18 19:09:28 UTC
Frederic, Benoit,

Thank you very much for the enormous help with triaging this bug.
A few quick tests on my system indicate that the problem is resolved by the linked patch. AFAICS the patch is only in the master branch as it just missed the 10.0.4 and 10.1.0 release. The commit is annotated and will be included in the next mesa bug fix releases.

Feel free to reopen the report if the bug is still around with the patch from commit 41.
Comment 43 Frederic Bezies 2014-04-19 13:04:57 UTC
Bug is fixed for good. Archlinux upgraded to Mesa 10.1.1 today, and after rebuilding firefox, no more freeze problem with nouveau. Thanks for the fix.