Bug 107369 - "volatile" in OpenCL code not recognized when compiling with -fstack-protector
Summary: "volatile" in OpenCL code not recognized when compiling with -fstack-protector
Status: RESOLVED MOVED
Alias: None
Product: Mesa
Classification: Unclassified
Component: Gallium/StateTracker/Clover (show other bugs)
Version: 18.0
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: mesa-dev
QA Contact: mesa-dev
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: 99553
  Show dependency treegraph
 
Reported: 2018-07-25 07:43 UTC by Gian-Carlo Pascutto
Modified: 2019-09-18 18:03 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
clover debug dump (10.63 MB, application/x-xz)
2018-07-31 16:25 UTC, infinity0
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Gian-Carlo Pascutto 2018-07-25 07:43:57 UTC
https://github.com/CNugteren/CLBlast/issues/298
https://github.com/gcp/leela-zero/issues/1612

Using "volatile" in OpenCL kernels will generate an "unsupported initializer for address space" error for the combinations of:

Mesa 18.1.3 (POLARIS10, DRM 3.23.0, 4.16.0-1-amd64, LLVM 6.0.1) 
Mesa 17.3.9 AMD KABINI (DRM 2.49.0 / 4.9.0-6-amd64, LLVM 5.0.1)

But works for:

Mesa 18.0.4 (POLARIS11 / DRM 3.23.0 / 4.16.13-1-MANJARO, LLVM 6.0.0)
Mesa 18.1.3 (POLARIS11, DRM 3.23.0, 4.16.18-1-MANJARO, LLVM 6.0.0)

So, this problem curiously seems to avoided by the POLARIS11 OpenCL support and not specific to the Mesa version itself.

volatile has been legal in OpenCL code at least since OpenCL 1.1 and is used in the above projects to improve register allocation (without which AMD hardware gets a performance penalty).
Comment 1 Jan Vesely 2018-07-25 14:10:39 UTC
mesa/clover does not touch the clc code. that is processed by clang/llvm.
can you run with CLOVER_DEBUG=llvm,native CLOVER_DEBUG_FILE=dump-file and post the files?
Comment 2 Aaron Watry 2018-07-25 19:33:48 UTC
My home system has an RX580 (Polaris10).

I just cloned/built CLBlast and it appears to be running ./clblast_tuner_xgemm (mentioned as the specific failing case in https://github.com/CNugteren/CLBlast/issues/298) successfully, so I don't know that the issue is polaris10 specific.

In my case, I've got Mesa 18.2.0-b21b38c46cd61d9 (latest upstream as of the last hour or so), libclc r335280 (latest revision since late June), and llvm 7.0.0-svn as of r337934 (a few minutes ago).

I did notice that the LLVM version for both failing cases is different than the passing ones, so I went and downgraded to llvm 6.0.1... but it still works.

w/ LLVM 6.0.1 (first section of 578 tests):
  Found best result 1.43 ms: 1505.0 GFLOPS

w/ LLVM 7.0.0svn:
  Found best result 1.50 ms: 1428.2 GFLOPS

I'd agree with Jan that a dump of the dumped llvm bitcode would be useful. Also, it may be interesting to try to upgrade libclc or mesa to the latest upstream code to see if one of those has an effect.
Comment 3 Gian-Carlo Pascutto 2018-07-25 19:49:34 UTC
You can see in my original comment that there are passes with LLVM 6.0.0 and failures for LLVM 6.0.1. So indeed, it can't be LLVM either.

I'm not sure if it's possible for libclc to be different between those, but it looks like it must be?

I asked one of the people suffering from this bug if he could provide the requested output. The affected distro appears to be Debian, so maybe we're looking at the version of this: https://packages.debian.org/sid/libclc-amdgcn
Comment 4 Aaron Watry 2018-07-26 14:25:42 UTC
My original theory (before I tested 6.0.1 myself) was that the spectre mitigation changes that went into 6.0.1 broke something about compilation of CL kernels, but that doesn't seem to have been the case (I did go through the entire commit history of 6.0.0->6.0.1 to try to identify things that looked suspicious).

I do run Ubuntu at home, so it would be feasible for me to at least install the libclc version they've got and give it a spin. LLVM/Mesa might be a bit more work, but if we run out of other options, I might give it a try. I'd have to downgrade my whole stack, but it's possible they're compiling with different flags/features which changes behavior somehow.

I think for now, it's worth giving the original reporter a bit of time to try to dump the bitcode and get us a bit more info that might help us reproduce/diagnose this since it seems there's something version/distro specific possibly at play.  It might be useful to know the/an exact build/tag/revision that is failing for CLBlast as well, just to eliminate that.  I'm assuming that the latest git master is broken, but feel free to tell me otherwise.

And to answer the implied question: libclc supports multiple LLVM versions, and we've supported LLVM 6.0.x in libclc for a while now (and still build against 3.9 - 7.0.0svn).
Comment 5 Aaron Watry 2018-07-26 16:34:48 UTC
I reverted llvm to the 6.0.0 version packaged in the padoka ppa (from my 7.0svn build) and libclc to the package manager's version and managed to reproduce the failure.

Failing versions:
  libclc: 0.2.0+git20180312-1
  llvm..: 1:6.0-41~exp5~ubuntu1
  mesa..: 18.2-dev (b21b38c46cd)

Since mesa hadn't changed and LLVM didn't seem to be the issue, I upgraded libclc to the current upstream revision (still built using llvm 6.0.0), and it started working.

Figuring that debian's version included the git checkout date, I checked out the latest code as of both Mar 8 and Mar 12, but both of those worked as well.

So, I checked out the debian sources and re-built their .debs on my system, those failed.

They've patched configure.py in libclc to add in some debian-specific build flags. When I copied those changes to upstream libclc as of 20180312 it started failing in the same way. Those same changes also break the current upstream libclc source.

The specific lines that debian have patched libclc's configure.py with that break things seem to be:

# add in debian build flags
proc_cflags = Popen(("dpkg-buildflags","--get","CFLAGS"), stdout=PIPE)
clang_bc_flags = " " + proc_cflags.communicate()[0].strip("\n")

Where the output of 'dpkg-buildflags --get CFLAGS' for me is:
-g -O2 -fdebug-prefix-map=/home/awatry/src/libclc=. -fstack-protector-strong -Wformat -Werror=format-security

Something in there is breaking the bitcode compilation for libclc in debian.
Comment 6 Aaron Watry 2018-07-26 17:03:03 UTC
Passing "-fstack-protector-strong" into clang_bc_flags is what's breaking libclc for debian.
Comment 7 Gian-Carlo Pascutto 2018-07-26 17:08:58 UTC
Nice debugging.

Is this LLVM miscompiling the library, or LLVM trying to do stack protection on AMD GCN code?
Comment 8 infinity0 2018-07-27 06:30:46 UTC
For background, this is the CFLAGS that Debian passes when compiling all C programs. From the GCC man page:

-fstack-protector
   Emit extra code to check for buffer overflows, such as stack smashing
   attacks.  This is done by adding a guard variable to functions with
   vulnerable objects.  This includes functions that call "alloca", and
   functions with buffers larger than 8 bytes.  The guards are initialized when
   a function is entered and then checked when the function exits.  If a guard
   check fails, an error message is printed and the program exits.

-fstack-protector-strong
   Like -fstack-protector but includes additional functions to be protected ---
   those that have local array definitions, or have references to local frame
   addresses.

Of course GPUs are different so this may not be appropriate. I'm not familiar with how OpenCLs work so please confirm what the most appropriate solution is for Debian, either:

- append -fno-stack-protector to the Debian flags, if the other flags are appropriate/relevant here

- omit all the flags completely and leave clang_bc_flags alone, if this is such a special thing that Debian's system-policy CFLAGS should be completely ignored
Comment 9 infinity0 2018-07-27 06:43:39 UTC
I guess a third possibility is that stack protectors are actually relevant for GPUs but Clang/LLVM is not generating correct code for those in this case.
Comment 10 Gian-Carlo Pascutto 2018-07-27 07:51:36 UTC
infini
Comment 11 Jan Vesely 2018-07-27 14:59:40 UTC
GPUs don't really have a stack (not for data anyway) and AMDGCN backend currently inlines all function calls anyway.
I'm not sure what kind of checks the flag adds.
If anyone can upload the different libclc bitcode it should be easy to spot.
My guess would be that it adds some initialized global variables used in internal checks. This is illegal in CLC. only variables in constant address space can have initializers.

IMO Debian should not be arbitrarily adding compilation flags unless they know what they're doing and they have tested the resulting package.

local variables are stored in private address space which is backed either by register file, or private buffers.
using volatile, to control where it is located, is a rather hacky workaround of suboptimal register allocation/instruction scheduling.
Reporting a llvm bug with a reproducer can help. it'd need to be reproducible using llvm-7, there are no further releases of llvm-5 or 6 planned.
Comment 12 Gian-Carlo Pascutto 2018-07-27 16:18:49 UTC
So the bugs are:

1) Debian adding flags even if they don't make sense (i.e. stack protection on GPU code). Should be filed in Debian against libclc.

2) LLVM respecting that flag even if it shouldn't. Should be checked if it still happens with LLVM 7 and filed against LLVM if so.
Comment 13 infinity0 2018-07-31 16:25:33 UTC
Created attachment 140907 [details]
clover debug dump

Attached a dump, it's 1 out of several thousand files and compressed is 11MB.
Comment 14 infinity0 2018-09-18 02:45:22 UTC
Did you get a chance to analyze my dump?

I filed a LLVM bug here https://bugs.llvm.org/show_bug.cgi?id=38979

Didn't yet get a chance to test it with LLVM 7 yet but I didn't manage to find any related bugs in their tracker so perhaps nobody fixed it yet.
Comment 15 Jan Vesely 2018-10-21 16:57:56 UTC
(In reply to infinity0 from comment #14)
> Did you get a chance to analyze my dump?
> 
> I filed a LLVM bug here https://bugs.llvm.org/show_bug.cgi?id=38979
> 
> Didn't yet get a chance to test it with LLVM 7 yet but I didn't manage to
> find any related bugs in their tracker so perhaps nobody fixed it yet.

Hi,

sorry for the late reply. given the already strained time/resources, this is not a priority. The fix to this bug is for Debian to stop modifying libclc cflags.
Comment 16 Stuart Young 2018-10-24 22:43:53 UTC
There's a fix for this in the Debian experimental package for libclc. If you haven't already done so and this bug affects you, please test and give feedback on the Debian bug.

Related links:
 Tracker: https://tracker.debian.org/pkg/libclc
 Pkg change detail: https://tracker.debian.org/news/994503/accepted-libclc-020git20180917-2-source-into-experimental/
 Debian bug report: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=904718
Comment 17 GitLab Migration User 2019-09-18 18:03:22 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/144.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.