Bug 91646 - dlopen'ing libudev.so.1 from static library initializer corrupts TLS state
Summary: dlopen'ing libudev.so.1 from static library initializer corrupts TLS state
Status: RESOLVED MOVED
Alias: None
Product: Mesa
Classification: Unclassified
Component: Mesa core (show other bugs)
Version: git
Hardware: x86-64 (AMD64) Linux (All)
: high major
Assignee: mesa-dev
QA Contact: mesa-dev
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-08-15 17:05 UTC by Timo R.
Modified: 2019-09-18 20:24 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
hack fix (1.05 KB, patch)
2015-08-15 17:31 UTC, Tobias Jakobi
Details | Splinter Review

Description Timo R. 2015-08-15 17:05:16 UTC
This is directly related to the following glibc bug:

https://sourceware.org/bugzilla/show_bug.cgi?id=15199

Something in mesa somewhere dlopens libudev.so.1 from within the early library initializer, which causes the TLS state in glibc to get corrupted if the application or some later library links against libudev.so.1.

The result of this is that the next time something uses a thread-local variable, it runs into an infinite loop in tls_get_addr_tail.

As a workaround I built mesa with gallium disabled, which made it work for my case.
The application triggering this behaviour was kodi, but everything that directly or indirectly links against mesa and libudev is potentialy affected, depending on the library load order.

I'm not 100% sure if this is actualy a bug in glibc, or doing dlopen from within a static library initializer is not well defined, but it's definitely something that needs addressing.

Encountered this with latest mesa git, never had that problem before with any release version.
Comment 1 Tobias Jakobi 2015-08-15 17:17:01 UTC
The problem seems to originate from here:
http://cgit.freedesktop.org/mesa/mesa/tree/src/gallium/state_trackers/clover/api/platform.cpp#n29

The platform object is created during static initialization and somewhere in the class hierarchy a dlopen() is triggered. Since _clover_platform only seems to be used in clGetPlatformIDs(), my idea would be to make this object static and put it into the function itself, so it only gets initialization when the function is first called.
Comment 2 Tobias Jakobi 2015-08-15 17:31:22 UTC
Created attachment 117708 [details] [review]
hack fix

Untested hack/fix that is also not thread-safe.
Comment 3 Francisco Jerez 2015-08-16 10:39:32 UTC
(In reply to Tobias Jakobi from comment #2)
> Created attachment 117708 [details] [review] [review]
> hack fix
> 
> Untested hack/fix that is also not thread-safe.

That's unlikely to work, static local variables are no different to globals regarding initialization order, and, yeah, it seems like a hack because pipe_loader_probe() shouldn't be doing anything that could corrupt the TLS state when called at initialization time.

It looks like this might be a regression from the series de5c2b6f2b53924bceab6f4b8255d8e9dcad21b4..cc32d25454c382a971e81ae584a4296fdf492e70(which are indeed not part of any released version yet), you may want to bisect which change introduced the problem.
Comment 4 Timo R. 2015-08-16 10:56:00 UTC
> That's unlikely to work, static local variables are no different to globals
> regarding initialization order,

To my knowledge, static local variables are initialized on the first call to the function, whereas global variables are initialized in the libraries early static initializer, which runs during library load.

> and, yeah, it seems like a hack because
> pipe_loader_probe() shouldn't be doing anything that could corrupt the TLS
> state when called at initialization time.

The simple act of calling dlopen on libudev.so.1 from within the early static initializer is enough to corrupt the TLS state, but only if some later library also links against libudev.so.1.
So not initializing the structure on library-load, but on first function call might actualy help.

> It looks like this might be a regression from the series
> de5c2b6f2b53924bceab6f4b8255d8e9dcad21b4..
> cc32d25454c382a971e81ae584a4296fdf492e70(which are indeed not part of any
> released version yet), you may want to bisect which change introduced the
> problem.

Not sure when I'll get to this, but I'll see what i can do.
Comment 5 Francisco Jerez 2015-08-16 11:47:02 UTC
(In reply to Timo R. from comment #4)
> > That's unlikely to work, static local variables are no different to globals
> > regarding initialization order,
> 
> To my knowledge, static local variables are initialized on the first call to
> the function, whereas global variables are initialized in the libraries
> early static initializer, which runs during library load.
> 

IIRC static local variables are allowed to be initialized statically under roughly the same set of conditions in which global variables are -- That said because the platform constructor has side-effects it looks like you're right and the platform will necessarily have to be initialized dynamically the first time the function is run.

> > and, yeah, it seems like a hack because
> > pipe_loader_probe() shouldn't be doing anything that could corrupt the TLS
> > state when called at initialization time.
> 
> The simple act of calling dlopen on libudev.so.1 from within the early
> static initializer is enough to corrupt the TLS state, but only if some
> later library also links against libudev.so.1.
> So not initializing the structure on library-load, but on first function
> call might actualy help.
> 
The thing is you have no guarantee that the function it's now being initialized from will not itself be called from a static-storage variable initializer, so assuming that the conditions you describe are enough to corrupt the TLS state this will only be hiding the problem.

> > It looks like this might be a regression from the series
> > de5c2b6f2b53924bceab6f4b8255d8e9dcad21b4..
> > cc32d25454c382a971e81ae584a4296fdf492e70(which are indeed not part of any
> > released version yet), you may want to bisect which change introduced the
> > problem.
> 
> Not sure when I'll get to this, but I'll see what i can do.
Comment 6 Serge Martin 2015-08-16 19:08:18 UTC
Hello

This remind me this. Something similar have happen to ocl-icd, see  https://bugzilla.redhat.com/show_bug.cgi?id=1219646
Comment 7 Eero Tamminen 2015-08-17 08:11:38 UTC
latrace tool could tell something useful:
http://people.redhat.com/jolsa/latrace/
Comment 8 Emil Velikov 2015-09-06 19:06:58 UTC
Did anyone find the time to bisect ? I won't mind reverting any of my commits but I'd like to know which one as I cannot really test this here.
Comment 9 GitLab Migration User 2019-09-18 20:24:06 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/990.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.