Bug 96897 - clpeak OpenCL benchmark hangs during compilation on Clover RadeonSI
Summary: clpeak OpenCL benchmark hangs during compilation on Clover RadeonSI
Status: RESOLVED FIXED
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/Gallium/radeonsi (show other bugs)
Version: git
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Default DRI bug account
QA Contact: Default DRI bug account
URL:
Whiteboard:
Keywords:
Depends on:
Blocks: 99553
  Show dependency treegraph
 
Reported: 2016-07-12 11:10 UTC by Jan Ziak (http://atom-symbol.net)
Modified: 2018-05-18 02:45 UTC (History)
5 users (show)

See Also:
i915 platform:
i915 features:


Attachments
gdb backtrace (72.40 KB, text/plain)
2016-07-12 11:13 UTC, Jan Ziak (http://atom-symbol.net)
Details
clinfo for the system (12.50 KB, text/plain)
2017-04-27 20:37 UTC, M. Edward (Ed) Borasky
Details

Description Jan Ziak (http://atom-symbol.net) 2016-07-12 11:10:26 UTC
Hello.

clpeak (http://github.com/krrishnarraj/clpeak) defacto enters an infinite loop during compilation.

GPU: R9 390
Kernel module: amdgpu.ko, linux 4.7.0-rc7
Mesa: 12.1.0-devel (git-ead7736)
LLVM: git 2016-jul-11

$ clinfo 
Number of platforms:    1 (should be 2: intel.cpu + mesa.gpu)
  Platform Version:     OpenCL 1.1 Mesa 12.1.0-devel (git-ead7736)

$ ll /usr/lib64/libOpenCL.so.1 
/usr/lib64/libOpenCL.so.1 -> OpenCL/vendors/mesa/libOpenCL.so.1.0.0

(Gentoo Linux)
$ eselect opencl list
Available OpenCL implementations:
  [1]   amdgpu-pro
  [2]   intel
  [3]   mesa *
  [4]   nvidia
Comment 1 Jan Ziak (http://atom-symbol.net) 2016-07-12 11:13:44 UTC
Created attachment 125023 [details]
gdb backtrace
Comment 2 Michel Dänzer 2016-07-13 05:29:04 UTC
Looks like deep recursion in clover / LLVM code.
Comment 3 Vedran Miletić 2017-03-22 16:13:17 UTC
Interesting, I will look into this.
Comment 4 Vedran Miletić 2017-03-22 18:24:10 UTC
Not anymore on both LLVM 3.9.1 and LLVM git from today:

input.cl:34:106: error: call to 'mad' is ambiguous
input.cl:30:22: note: expanded from macro 'MAD_64'
input.cl:29:22: note: expanded from macro 'MAD_16'
input.cl:28:25: note: expanded from macro 'MAD_4'
/usr/local/include/clc/math/mad.inc:1:39: note: candidate function
/usr/local/include/clc/math/mad.inc:1:39: note: candidate function
/usr/local/include/clc/math/mad.inc:1:39: note: candidate function
/usr/local/include/clc/math/mad.inc:1:39: note: candidate function
/usr/local/include/clc/math/mad.inc:1:39: note: candidate function
/usr/local/include/clc/math/mad.inc:1:39: note: candidate function
/usr/local/include/clc/math/mad.inc:1:39: note: candidate function
/usr/local/include/clc/math/mad.inc:1:39: note: candidate function
/usr/local/include/clc/math/mad.inc:1:39: note: candidate function
/usr/local/include/clc/math/mad.inc:1:39: note: candidate function
/usr/local/include/clc/math/mad.inc:1:39: note: candidate function
/usr/local/include/clc/math/mad.inc:1:39: note: candidate function
input.cl:34:106: error: call to 'mad' is ambiguous

Did clpeak change or did we change? If we changed, did we regress?
Comment 5 Jan Ziak (http://atom-symbol.net) 2017-03-22 22:08:04 UTC
With LLVM 4.0.0 I am getting the following results:

$ clinfo
  Platform ID:			0x7ff6aaf2ed60
  Name:				AMD HAWAII (DRM 3.10.0 / 4.11.0-rc2+, LLVM 4.0.0)
  Vendor:			AMD
  Device OpenCL C version:	OpenCL C 1.1 
  Driver version:		17.1.0-devel
  Profile:			FULL_PROFILE
  Version:			OpenCL 1.1 Mesa 17.1.0-devel (git-ad13bd2)

$ ./clpeak
Platform: Clover
  Device: AMD HAWAII (DRM 3.10.0 / 4.11.0-rc2+, LLVM 4.0.0)
    Driver version  : 17.1.0-devel (Linux x64)
    Compute units   : 40
    Clock frequency : 1000 MHz
clpeak: /var/tmp/portage/sys-devel/clang-4.0.0/work/x/y/cfe-4.0.0.src/lib/Sema/Sema.cpp:317: clang::Sema::~Sema(): Assertion `DelayedTypos.empty() && "Uncorrected typos!"' failed.
Aborted (core dumped)
Comment 6 Andy Furniss 2017-03-22 22:59:53 UTC
Same for me on tonga + git llvm/libclc/mesa/clpeak

Platform: Clover
  Device: AMD TONGA (DRM 3.13.0 / 4.11.0-rc1-g00c1259, LLVM 5.0.0)
    Driver version  : 17.1.0-devel (Linux x64)
    Compute units   : 28
    Clock frequency : 973 MHz
clpeak: /mnt/sdb1/Gits/llvm/tools/clang/lib/Sema/Sema.cpp:316: clang::Sema::~Sema(): Assertion `DelayedTypos.empty() && "Uncorrected typos!"' failed.
Aborted
Comment 7 Andy Furniss 2017-03-22 23:47:34 UTC
(In reply to Andy Furniss from comment #6)
> Same for me on tonga + git llvm/libclc/mesa/clpeak
> 
> Platform: Clover
>   Device: AMD TONGA (DRM 3.13.0 / 4.11.0-rc1-g00c1259, LLVM 5.0.0)
>     Driver version  : 17.1.0-devel (Linux x64)
>     Compute units   : 28
>     Clock frequency : 973 MHz
> clpeak: /mnt/sdb1/Gits/llvm/tools/clang/lib/Sema/Sema.cpp:316:
> clang::Sema::~Sema(): Assertion `DelayedTypos.empty() && "Uncorrected
> typos!"' failed.
> Aborted

This starts with clpeak commit -

16e1b207a4d4e81a0c48c77c950437dca1364cb6 is the first bad commit
commit 16e1b207a4d4e81a0c48c77c950437dca1364cb6
Author: espes <espes@pequalsnp.com>
Date:   Mon Jul 18 17:06:15 2016 -0700

    Add support for halfs

Before this it completes OK, but there is some delay ~40 seconds, before results start appearing.
Comment 8 Ricardo Ribalda 2017-04-27 07:59:57 UTC
With:

  Device: AMD CARRIZO (DRM 3.9.0 / 4.10.0-qtec-standard, LLVM 4.0.1)
    Driver version  : 17.0.3 (Linux x64)
    Compute units   : 8
    Clock frequency : 800 MHz


I am getting the same error as Vedran: error: call to 'mad' is ambiguous

After reverting:

16e1b207a4d4e81a0c48c77c950437dca1364cb6 is the first bad commit
commit 16e1b207a4d4e81a0c48c77c950437dca1364cb6
Author: espes <espes@pequalsnp.com>
Date:   Mon Jul 18 17:06:15 2016 -0700

I am experiencing an endless loop as reported by Jan.

I get the same endless loop with:

Platform: Clover
  Device: AMD PALM (DRM 2.49.0 / 4.10.0-qtec-standard, LLVM 4.0.1)
    Driver version  : 17.0.3 (Linux x64)
    Compute units   : 2
    Clock frequency : 0 MHz
Comment 9 M. Edward (Ed) Borasky 2017-04-27 20:34:36 UTC
I have something like this on Fedora - both 25 (stable) and 26 (alpha). I type "clpeak" and the CPU goes to 100% and nothing else happens.

I'll attach a 'clinfo' printout.
Comment 10 M. Edward (Ed) Borasky 2017-04-27 20:37:24 UTC
Created attachment 131104 [details]
clinfo for the system

Note: this bug is in Fedora's bugzilla as well - https://bugzilla.redhat.com/show_bug.cgi?id=1433632
Comment 11 M. Edward (Ed) Borasky 2017-05-08 05:56:06 UTC
Linking to a clpeak GitHub issue: https://github.com/krrishnarraj/clpeak/issues/32

Note: I'm now on Arch Linux and I have the non-looping version of this.
Comment 12 Jan Vesely 2017-10-28 22:26:59 UTC
> input.cl:34:106: error: call to 'mad' is ambiguous
This looks to be caused by the lack of half precision builtins in libclc. GCN+ GPUs advertise support for cl_khr_fp16 in CLC but libclc is not ready yet.

You can try my experimental cl_khr_fp16 branch:
https://github.com/jvesely/libclc/tree/cl_khr_fp16
Comment 13 Jan Vesely 2018-05-17 23:38:52 UTC
Initial support for cl_khr_fp16 builtins has been added to libclc in r332677.
It should be enough to run clpeak.
clpeak still takes few mins to compile the kernels (~7mins on my carrizo laptop)
Comment 14 Dieter Nützel 2018-05-18 02:45:20 UTC
(In reply to Jan Vesely from comment #13)
> Initial support for cl_khr_fp16 builtins has been added to libclc in r332677.
> It should be enough to run clpeak.
> clpeak still takes few mins to compile the kernels (~7mins on my carrizo
> laptop)

GREAT work Jan!

After 3 min and ~12 sec float start crunching on my X3470 Xeon
(only one core would be used for kernel compile => 3.6 GHz turbo mode)

My desktop was frozen during float 'Global memory bandwidth (GBPS)' compute
and partly frozen during 'Double-precision compute (GFLOPS)'.

Whole benchmark finished after 6 min and 17 secs.

/home/dieter> time clpeak

Platform: Clover
  Device: Radeon RX 580 Series (POLARIS10, DRM 3.23.0, 4.16.9-1.g4f45b1e-default, LLVM 7.0.0)
    Driver version  : 18.2.0-devel (Linux x64)
    Compute units   : 36
    Clock frequency : 1411 MHz

    Global memory bandwidth (GBPS)
      float   : 2.64
      float2  : 2.64
      float4  : 2.64
      float8  : 2.54
      float16 : 1.45

    Single-precision compute (GFLOPS)
      float   : 6341.87
      float2  : 6131.34
      float4  : 6105.61
      float8  : 5933.91
      float16 : 5939.44

    half-precision compute (GFLOPS)
      half   : 6307.47
      half2  : 6193.25
      half4  : 6114.34
      half8  : 5729.57
      half16 : 6047.90

    Double-precision compute (GFLOPS)
      double   : 404.52
      double2  : 404.41
      double4  : 404.06
      double8  : 403.08
      double16 : 401.53

    Integer compute (GIOPS)
      int   : 1222.75
      int2  : 1213.90
      int4  : 1210.72
      int8  : 1208.57
      int16 : 1213.99

    Transfer bandwidth (GBPS)
      enqueueWriteBuffer         : 8.78
      enqueueReadBuffer          : 4.86
      enqueueMapBuffer(for read) : 4871.79
        memcpy from mapped ptr   : 4.94
      enqueueUnmap(after write)  : 3528.56
        memcpy to mapped ptr     : 4.94

    Kernel launch latency : 293.57 us

206.285u 3.765s 6:17.14 55.6%   0+0k 0+0io 0pf+0w


For reference AMD 17.40
/home/dieter> time clpeak

Platform: AMD Accelerated Parallel Processing
  Device: Ellesmere
    Driver version  : 2482.3 (Linux x64)
    Compute units   : 36
    Clock frequency : 1411 MHz

    Global memory bandwidth (GBPS)
      float   : 202.59
      float2  : 209.30
      float4  : 209.63
      float8  : 162.15
      float16 : 138.41

    Single-precision compute (GFLOPS)
      float   : 6342.71
      float2  : 6374.96
      float4  : 6178.29
      float8  : 5973.53
      float16 : 6018.79

    half-precision compute (GFLOPS)
      half   : 6306.97
      half2  : 6366.06
      half4  : 6350.41
      half8  : 6154.31
      half16 : 6280.47

    Double-precision compute (GFLOPS)
      double   : 404.64
      double2  : 404.38
      double4  : 398.54
      double8  : 403.25
      double16 : 401.53

    Integer compute (GIOPS)
      int   : 1206.77
      int2  : 1221.26
      int4  : 1225.83
      int8  : 1225.88
      int16 : 1227.35

    Transfer bandwidth (GBPS)
      enqueueWriteBuffer         : 9.03
      enqueueReadBuffer          : 5.08
      enqueueMapBuffer(for read) : 149130.81
        memcpy from mapped ptr   : 5.09
      enqueueUnmap(after write)  : 75882.81
        memcpy to mapped ptr     : 5.08

    Kernel launch latency : 93.33 us

23.056u 1.592s 1:08.29 36.0%    0+0k 0+0io 0pf+0w


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.