Bug 77580 - GCC memory starvation caused by flatten attribute with LTO
Summary: GCC memory starvation caused by flatten attribute with LTO
Status: RESOLVED MOVED
Alias: None
Product: xorg
Classification: Unclassified
Component: Driver/intel (show other bugs)
Version: unspecified
Hardware: Other All
: low normal
Assignee: Chris Wilson
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2014-04-17 15:19 UTC by Martin Liska
Modified: 2019-11-27 13:33 UTC (History)
3 users (show)

See Also:
i915 platform:
i915 features:


Attachments
LTO patch (916 bytes, text/plain)
2014-04-17 15:19 UTC, Martin Liska
no flags Details
Patch to allow LTO optmization of xf86-video-intel (832 bytes, patch)
2016-07-15 18:47 UTC, Patrick McMunn
no flags Details | Splinter Review
pre-processed sources (120.38 KB, application/x-xz)
2016-07-23 19:15 UTC, main.haarp
no flags Details

Description Martin Liska 2014-04-17 15:19:57 UTC
Created attachment 97520 [details]
LTO patch

Hello,
    I've been testing GCC 4.9 for a virtual gentoo machine and I noticed that you us flatten attribute in source code. In case of src/sna/sna_glyphs.c flatten functions, inliner inlines about 3.3M functions and crashes because of no free memory (I have 8GB memory).

Please notice that LTO has ability to optimize whole program. As a result, it sees almost all function bodies and that leads to enormous inlining.

Suggested patch removes these flatten attributes for selected functions.

Thank you,
MArtin
Comment 1 Chris Wilson 2014-04-17 19:55:06 UTC
If you can detect LTO and no-op the flatten define, that sounds like an useful patch.
Comment 2 Patrick McMunn 2015-12-22 04:55:56 UTC
I have wanted for a long time to be able to use LTO on xf86-video-intel, so I was very pleased when I found this patch submission. I tried it on the latest development version of the Intel driver from git. It applied successfully, and I was able to successfully compile the Intel driver using LTO instead of experiencing the seeming infinite compile time otherwise resulting from LTO.

However, despite compilation being successful, my tests which involved glxgears, monitoring CPU usage, and watching videos on Youtube showed significantly poorer video performance with the LTO-compiled driver than without LTO. Though glxgears showed no discernible difference, Youtube performance was incredibly slow such that the audio portion of the video continued at normal speed while the video lagged progressively further behind in slow motion.

I have no way of knowing if LTO is directly responsible for the poor performance or if the patch somehow led to poor optimization by LTO, but this should be investigated further. I used GCC 4.9.3 for the test. My Linux distro currently doesn't offer the 5.x branch of GCC, so I was unable to test with GCC 5.3.
Comment 3 Martin Liska 2016-01-04 13:15:58 UTC
(In reply to Patrick McMunn from comment #2)
> I have wanted for a long time to be able to use LTO on xf86-video-intel, so
> I was very pleased when I found this patch submission. I tried it on the
> latest development version of the Intel driver from git. It applied
> successfully, and I was able to successfully compile the Intel driver using
> LTO instead of experiencing the seeming infinite compile time otherwise
> resulting from LTO.
> 
> However, despite compilation being successful, my tests which involved
> glxgears, monitoring CPU usage, and watching videos on Youtube showed
> significantly poorer video performance with the LTO-compiled driver than
> without LTO. Though glxgears showed no discernible difference, Youtube
> performance was incredibly slow such that the audio portion of the video
> continued at normal speed while the video lagged progressively further
> behind in slow motion.
> 
> I have no way of knowing if LTO is directly responsible for the poor
> performance or if the patch somehow led to poor optimization by LTO, but
> this should be investigated further. I used GCC 4.9.3 for the test. My Linux
> distro currently doesn't offer the 5.x branch of GCC, so I was unable to
> test with GCC 5.3.

Hi Patrick.

Well, it looks that xf86-video-intel driver needs flattened functions to produce optimal code. It would be interesting, if you rebuild the driver with the suggested patch applied (or is it part of mainline?) and try to generate perf report that can provide comparison between LTO and non-LTO build.

Martin
Comment 4 Patrick McMunn 2016-07-15 01:57:04 UTC
I know it's been a while since I followed up. First of all, I don't know how to generate a perf report, so that's partly why I didn't follow up sooner. Perhaps you can point me to a tutorial?

Also, in retrospect, I suppose there's a possibility that I may have compiled incorrectly. They -flto compiler flag, as well as the same compiler optimization flags used during compilation, must also be used during the linking phase. I use Funtoo, a variant of Gentoo, and I had -flto in my CFLAGS/CXXFLAGS but not in my LDFLAGS. This is handled by different programs makefiles. Some automatically add CXXFLAGS to LDFLAGS during the linking phase; some don't. I'm not sure about xf86-video-intel's linking phase. So it's possible that, in my earlier attempt, it wasn't linked and optimized properly. So I should probably revisit that.

In my earlier attempt, I was using GCC 4.9.3. Since that time, I've been using GCC 5.3.0 and, more recently, 6.1.0. GCC 5 and 6 fail the compilation phase before it even gets to the linking phase. Though that can be resolved by removing a few force_inline directives in the source code. In fact, I tried removing every instance of force_inline, and it not only allowed compilation to complete, it slightly reduced the memory used during linking. Alas linking still failed, apparently due to using too much memory.

Anyway, I'll give your patch another go just in case what I described above is what caused poorer performance with the LTO build. I really would like to get LTO working with this driver. LTO has really improved a lot in GCC with the 6.1.0 release.
Comment 5 Patrick McMunn 2016-07-15 18:47:44 UTC
Created attachment 125089 [details] [review]
Patch to allow LTO optmization of xf86-video-intel

I'm attaching a patch that takes a somewhat simpler approach to disabling "flatten" and "force_inline" which cause problems with LTO.

I used glxgears to measure FPS. The system was a Pentium 4 with an integrated 845G graphics chip, so it's pretty underpowered hardware, hence the low FPS. Without LTO, I got ~19.5 FPS running glxgears at fullscreen at 1680x1050 resolution. With LTO, I got about 22.5 FPS. That may not seem like much, but that's a 10% performance improvement! Except for the LTO-related flags, I used the same compiler and linker flags during compilation.

flags used during compilation:

CFLAGS="-O2 -march=pentium4 -pipe -fdevirtualize-speculatively ${LTO}"
CXXFLAGS="${CFLAGS}"
LDFLAGS="-Wl,-O1 -Wl,--as-needed -Wl,--enable-new-dtags -Wl,--sort-common -Wl,--sort-section=name -Wl,-Bsymbolic-functions -Wl,-z,combreloc ${CXXFLAGS}"
Comment 6 main.haarp 2016-07-16 11:49:33 UTC
(In reply to Patrick McMunn from comment #5)
> Created attachment 125089 [details] [review] [review]
> Patch to allow LTO optmization of xf86-video-intel
> 

Nice work! This allows building with LTO on gcc-4.9. gcc-5.3.0 however fails to build it, throwing a bunch of these around:

/usr/include/bits/string3.h:50:1: error: inlining failed in call to always_inline ‘memcpy’: target specific option mismatch


Also, unlike your P4, I cannot detect any measurable performance improvements (Intel Sandy Bridge). Unfortunate.

All tests done with xf86-video-intel from today's git.
Comment 7 Martin Liška 2016-07-18 11:25:34 UTC
I can confirm that the attached patch helps to build the project with LTO enabled. I tried both 5.3.1 and latest trunk (7.0.0) and both work fine.

Can you please attach pre-processed source code for the issue:
/usr/include/bits/string3.h:50:1: error: inlining failed in call to always_inline ‘memcpy’: target specific option mismatch

and also command line arguments for GCC.

Thanks
Comment 8 Patrick McMunn 2016-07-20 03:41:49 UTC
(In reply to main.haarp from comment #6)
> (In reply to Patrick McMunn from comment #5)
> > Created attachment 125089 [details] [review] [review] [review]
> > Patch to allow LTO optmization of xf86-video-intel
> > 
> 
> Nice work! This allows building with LTO on gcc-4.9. gcc-5.3.0 however fails
> to build it, throwing a bunch of these around:
> 
> /usr/include/bits/string3.h:50:1: error: inlining failed in call to
> always_inline ‘memcpy’: target specific option mismatch
> 
> 
> Also, unlike your P4, I cannot detect any measurable performance
> improvements (Intel Sandy Bridge). Unfortunate.
> 
> All tests done with xf86-video-intel from today's git.

Hmm... You shouldn't be getting that error. That's the error I was getting before I removed the "always_inline" definitions. If the patch worked properly for you, the compiler shouldn't be running into a "always_inline" directive.

I don't think I actually tested it with 5.3.0. I only verified that it builds with 4.9.3, and all my tests with performance involved 6.1.0.

I did do numerous tests involving various compiler flags such as -O2, -O3, graphite compiler flags, etc, and I found that I actually got reduced performance with -O3 compared to -O2, and I got reduced performance with using additional optimizations like graphite. The biggest improvement I got was with simple -O2 and a proper -march setting.
Comment 9 Patrick McMunn 2016-07-20 03:51:30 UTC
Also, make sure that you're passing the same flags to the linker that you're passing to the compiler. If your compiler flags are something like

CXXFLAGS="-O2 -march=native -flto -ffat-lto-objects -fuse-linker-plugin"

then your linker flags should look something like

LDFLAGS="-Wl,-O1 -Wl,--as-needed -O2 -march=native -flto -ffat-lto-objects -fuse-linker-plugin"

Otherwise it won't be properly optimized during the linking phase.
Comment 10 main.haarp 2016-07-23 19:15:17 UTC
Created attachment 125281 [details]
pre-processed sources

(In reply to Martin Liška from comment #7)
> I can confirm that the attached patch helps to build the project with LTO
> enabled. I tried both 5.3.1 and latest trunk (7.0.0) and both work fine.
> 
> Can you please attach pre-processed source code for the issue:
> /usr/include/bits/string3.h:50:1: error: inlining failed in call to
> always_inline ‘memcpy’: target specific option mismatch
> 
> and also command line arguments for GCC.
> 
> Thanks

Here are the command line arguments and the full error log.

-------------------------

gcc-5.4.0 -DHAVE_CONFIG_H -I. -I/var/tmp/portage/x11-drivers/xf86-video-intel-9999/work/xf86-video-intel-9999/src/sna -I../.. -Wall -Wpointer-arith -Wmissing-declarations -Wformat=2 -Wstrict-prototypes -Wmissing-prototypes -Wnested-externs -Wbad-function-cast -Wold-style-definition -Wdeclaration-after-statement -Wunused -Wuninitialized -Wshadow -Wmissing-noreturn -Wmissing-format-attribute -Wredundant-decls -Wlogical-op -Werror=implicit -Werror=nonnull -Werror=init-self -Werror=main -Werror=missing-braces -Werror=sequence-point -Werror=return-type -Werror=trigraphs -Werror=array-bounds -Werror=write-strings -Werror=address -Werror=int-to-pointer-cast -Werror=pointer-to-int-cast -fno-strict-aliasing -I/var/tmp/portage/x11-drivers/xf86-video-intel-9999/work/xf86-video-intel-9999/src -I/var/tmp/portage/x11-drivers/xf86-video-intel-9999/work/xf86-video-intel-9999/src/render_program -fvisibility=hidden -I/usr/include/xorg -I/usr/include/X11/dri -I/usr/include/libdrm -I/usr/include/pixman-1 -I/usr/include/libdrm -Wno-cast-qual -Wno-redundant-decls -Wno-maybe-uninitialized -march=native -O2 -pipe -fno-stack-protector -flto=4 -c /var/tmp/portage/x11-drivers/xf86-video-intel-9999/work/xf86-video-intel-9999/src/sna/blt.c

In file included from /usr/include/features.h:365:0,
                 from /usr/include/stdint.h:25,
                 from /usr/lib/gcc/x86_64-pc-linux-gnu/5.4.0/include/stdint.h:9,
                 from /var/tmp/portage/x11-drivers/xf86-video-intel-9999/work/xf86-video-intel-9999/src/sna/sna.h:40,
                 from /var/tmp/portage/x11-drivers/xf86-video-intel-9999/work/xf86-video-intel-9999/src/sna/blt.c:32:
/var/tmp/portage/x11-drivers/xf86-video-intel-9999/work/xf86-video-intel-9999/src/sna/blt.c: In function ‘to_memcpy’:
/usr/include/bits/string3.h:50:1: error: inlining failed in call to always_inline ‘memcpy’: target specific option mismatch
 __NTH (memcpy (void *__restrict __dest, const void *__restrict __src,
 ^
/var/tmp/portage/x11-drivers/xf86-video-intel-9999/work/xf86-video-intel-9999/src/sna/blt.c:490:4: error: called from here
    memcpy(dst, src, len);
    ^
In file included from /usr/include/features.h:365:0,
                 from /usr/include/stdint.h:25,
                 from /usr/lib/gcc/x86_64-pc-linux-gnu/5.4.0/include/stdint.h:9,
                 from /var/tmp/portage/x11-drivers/xf86-video-intel-9999/work/xf86-video-intel-9999/src/sna/sna.h:40,
                 from /var/tmp/portage/x11-drivers/xf86-video-intel-9999/work/xf86-video-intel-9999/src/sna/blt.c:32:
/usr/include/bits/string3.h:50:1: error: inlining failed in call to always_inline ‘memcpy’: target specific option mismatch
 __NTH (memcpy (void *__restrict __dest, const void *__restrict __src,
 ^
/var/tmp/portage/x11-drivers/xf86-video-intel-9999/work/xf86-video-intel-9999/src/sna/blt.c:555:2: error: called from here
  memcpy(dst, src, len & 3);
  ^
In file included from /usr/include/features.h:365:0,
                 from /usr/include/stdint.h:25,
                 from /usr/lib/gcc/x86_64-pc-linux-gnu/5.4.0/include/stdint.h:9,
                 from /var/tmp/portage/x11-drivers/xf86-video-intel-9999/work/xf86-video-intel-9999/src/sna/sna.h:40,
                 from /var/tmp/portage/x11-drivers/xf86-video-intel-9999/work/xf86-video-intel-9999/src/sna/blt.c:32:
/usr/include/bits/string3.h:50:1: error: inlining failed in call to always_inline ‘memcpy’: target specific option mismatch
 __NTH (memcpy (void *__restrict __dest, const void *__restrict __src,
 ^
/var/tmp/portage/x11-drivers/xf86-video-intel-9999/work/xf86-video-intel-9999/src/sna/blt.c:490:4: error: called from here
    memcpy(dst, src, len);
    ^
In file included from /usr/include/features.h:365:0,
                 from /usr/include/stdint.h:25,
                 from /usr/lib/gcc/x86_64-pc-linux-gnu/5.4.0/include/stdint.h:9,
                 from /var/tmp/portage/x11-drivers/xf86-video-intel-9999/work/xf86-video-intel-9999/src/sna/sna.h:40,
                 from /var/tmp/portage/x11-drivers/xf86-video-intel-9999/work/xf86-video-intel-9999/src/sna/blt.c:32:
/usr/include/bits/string3.h:50:1: error: inlining failed in call to always_inline ‘memcpy’: target specific option mismatch
 __NTH (memcpy (void *__restrict __dest, const void *__restrict __src,
 ^
/var/tmp/portage/x11-drivers/xf86-video-intel-9999/work/xf86-video-intel-9999/src/sna/blt.c:555:2: error: called from here
  memcpy(dst, src, len & 3);
  ^
Makefile:655: recipe for target 'blt.lo' failed

-----------------------------------

Attached is what gcc generates with -E, I think that's what you needed, correct?

Thanks!
Comment 11 Martin Liška 2016-07-25 11:39:56 UTC
(In reply to main.haarp from comment #10)
> Created attachment 125281 [details]
> pre-processed sources
> 
> (In reply to Martin Liška from comment #7)
> > I can confirm that the attached patch helps to build the project with LTO
> > enabled. I tried both 5.3.1 and latest trunk (7.0.0) and both work fine.
> > 
> > Can you please attach pre-processed source code for the issue:
> > /usr/include/bits/string3.h:50:1: error: inlining failed in call to
> > always_inline ‘memcpy’: target specific option mismatch
> > 
> > and also command line arguments for GCC.
> > 
> > Thanks
> 
> Here are the command line arguments and the full error log.
> 
> -------------------------
> 
> gcc-5.4.0 -DHAVE_CONFIG_H -I.
> -I/var/tmp/portage/x11-drivers/xf86-video-intel-9999/work/xf86-video-intel-
> 9999/src/sna -I../.. -Wall -Wpointer-arith -Wmissing-declarations -Wformat=2
> -Wstrict-prototypes -Wmissing-prototypes -Wnested-externs
> -Wbad-function-cast -Wold-style-definition -Wdeclaration-after-statement
> -Wunused -Wuninitialized -Wshadow -Wmissing-noreturn
> -Wmissing-format-attribute -Wredundant-decls -Wlogical-op -Werror=implicit
> -Werror=nonnull -Werror=init-self -Werror=main -Werror=missing-braces
> -Werror=sequence-point -Werror=return-type -Werror=trigraphs
> -Werror=array-bounds -Werror=write-strings -Werror=address
> -Werror=int-to-pointer-cast -Werror=pointer-to-int-cast -fno-strict-aliasing
> -I/var/tmp/portage/x11-drivers/xf86-video-intel-9999/work/xf86-video-intel-
> 9999/src
> -I/var/tmp/portage/x11-drivers/xf86-video-intel-9999/work/xf86-video-intel-
> 9999/src/render_program -fvisibility=hidden -I/usr/include/xorg
> -I/usr/include/X11/dri -I/usr/include/libdrm -I/usr/include/pixman-1
> -I/usr/include/libdrm -Wno-cast-qual -Wno-redundant-decls
> -Wno-maybe-uninitialized -march=native -O2 -pipe -fno-stack-protector
> -flto=4 -c
> /var/tmp/portage/x11-drivers/xf86-video-intel-9999/work/xf86-video-intel-
> 9999/src/sna/blt.c
> 
> In file included from /usr/include/features.h:365:0,
>                  from /usr/include/stdint.h:25,
>                  from
> /usr/lib/gcc/x86_64-pc-linux-gnu/5.4.0/include/stdint.h:9,
>                  from
> /var/tmp/portage/x11-drivers/xf86-video-intel-9999/work/xf86-video-intel-
> 9999/src/sna/sna.h:40,
>                  from
> /var/tmp/portage/x11-drivers/xf86-video-intel-9999/work/xf86-video-intel-
> 9999/src/sna/blt.c:32:
> /var/tmp/portage/x11-drivers/xf86-video-intel-9999/work/xf86-video-intel-
> 9999/src/sna/blt.c: In function ‘to_memcpy’:
> /usr/include/bits/string3.h:50:1: error: inlining failed in call to
> always_inline ‘memcpy’: target specific option mismatch
>  __NTH (memcpy (void *__restrict __dest, const void *__restrict __src,
>  ^
> /var/tmp/portage/x11-drivers/xf86-video-intel-9999/work/xf86-video-intel-
> 9999/src/sna/blt.c:490:4: error: called from here
>     memcpy(dst, src, len);
>     ^
> In file included from /usr/include/features.h:365:0,
>                  from /usr/include/stdint.h:25,
>                  from
> /usr/lib/gcc/x86_64-pc-linux-gnu/5.4.0/include/stdint.h:9,
>                  from
> /var/tmp/portage/x11-drivers/xf86-video-intel-9999/work/xf86-video-intel-
> 9999/src/sna/sna.h:40,
>                  from
> /var/tmp/portage/x11-drivers/xf86-video-intel-9999/work/xf86-video-intel-
> 9999/src/sna/blt.c:32:
> /usr/include/bits/string3.h:50:1: error: inlining failed in call to
> always_inline ‘memcpy’: target specific option mismatch
>  __NTH (memcpy (void *__restrict __dest, const void *__restrict __src,
>  ^
> /var/tmp/portage/x11-drivers/xf86-video-intel-9999/work/xf86-video-intel-
> 9999/src/sna/blt.c:555:2: error: called from here
>   memcpy(dst, src, len & 3);
>   ^
> In file included from /usr/include/features.h:365:0,
>                  from /usr/include/stdint.h:25,
>                  from
> /usr/lib/gcc/x86_64-pc-linux-gnu/5.4.0/include/stdint.h:9,
>                  from
> /var/tmp/portage/x11-drivers/xf86-video-intel-9999/work/xf86-video-intel-
> 9999/src/sna/sna.h:40,
>                  from
> /var/tmp/portage/x11-drivers/xf86-video-intel-9999/work/xf86-video-intel-
> 9999/src/sna/blt.c:32:
> /usr/include/bits/string3.h:50:1: error: inlining failed in call to
> always_inline ‘memcpy’: target specific option mismatch
>  __NTH (memcpy (void *__restrict __dest, const void *__restrict __src,
>  ^
> /var/tmp/portage/x11-drivers/xf86-video-intel-9999/work/xf86-video-intel-
> 9999/src/sna/blt.c:490:4: error: called from here
>     memcpy(dst, src, len);
>     ^
> In file included from /usr/include/features.h:365:0,
>                  from /usr/include/stdint.h:25,
>                  from
> /usr/lib/gcc/x86_64-pc-linux-gnu/5.4.0/include/stdint.h:9,
>                  from
> /var/tmp/portage/x11-drivers/xf86-video-intel-9999/work/xf86-video-intel-
> 9999/src/sna/sna.h:40,
>                  from
> /var/tmp/portage/x11-drivers/xf86-video-intel-9999/work/xf86-video-intel-
> 9999/src/sna/blt.c:32:
> /usr/include/bits/string3.h:50:1: error: inlining failed in call to
> always_inline ‘memcpy’: target specific option mismatch
>  __NTH (memcpy (void *__restrict __dest, const void *__restrict __src,
>  ^
> /var/tmp/portage/x11-drivers/xf86-video-intel-9999/work/xf86-video-intel-
> 9999/src/sna/blt.c:555:2: error: called from here
>   memcpy(dst, src, len & 3);
>   ^
> Makefile:655: recipe for target 'blt.lo' failed
> 
> -----------------------------------
> 
> Attached is what gcc generates with -E, I think that's what you needed,
> correct?
> 
> Thanks!

I can confirm that, please take a look at the created PR:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71991
Comment 12 main.haarp 2016-08-01 14:03:23 UTC
(In reply to Martin Liška from comment #11)
> 
> I can confirm that, please take a look at the created PR:
> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71991

Interesting, thanks for filing this!
Comment 13 Martin Peres 2019-11-27 13:33:27 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/xorg/driver/xf86-video-intel/issues/28.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.