Bug 92491

Summary:	Performance compare to Intel® proprietary realization
Product:	Beignet	Reporter:	ilia <inferrna>
Component:	Beignet	Assignee:	rongyang <rong.r.yang>
Status:	RESOLVED MOVED	QA Contact:
Severity:	minor
Priority:	medium
Version:	unspecified
Hardware:	x86-64 (AMD64)
OS:	All
Whiteboard:
i915 platform:		i915 features:
Attachments:	clinfo Source code

Description ilia 2015-10-16 09:55:18 UTC

Created attachment 118911 [details]
clinfo

Beignet GPU vs Intel® OpenCL CPU vs POCL, secs:
1.351889 vs 1.073667 vs 7.501667

Intel® software is opencl-1.2-5.0.0.57 (for CPU only)
Beignet master at 00e207e2a4e428ee1158159ad50c8188f9fe23f0 with llvm-3.7

My task is naive unoptimized dct on bunch of float arrays with size 128.
Intel® variant works faster even on 65536 global size - seems like it does some obvious optimizations on dct algorithm, which is very optimizable, as we know.
 Is it possible to pass some optimization parameters to llvm to get more faster code? Or Intel®'s black magic is impossible to be repeated?

Comment 1 ilia 2015-10-16 09:57:12 UTC

Created attachment 118912 [details]
Source code

gcc dcttest.c -O3 -lOpenCL -lm -o dcttest
# for 3 available platforms run it as
for i in 0 1 2; do ./dcttest $i; done

Comment 2 rongyang 2015-11-27 03:46:18 UTC

I take a quick look to your kernel, there are two improvement points:
1. use native_cos instead of cos, but it would lose precision, if your program is precision sensitive, can't use this method.
2. add some #pragma unroll, data[128] and res[128] are private arrays, beignet will store them in global memory. Because array data's visit are always constant in the loop, so you could add the unroll hint to compiler, then it is promoted to register, could improve performance significant.

For more optimization tips, please refer to http://www.freedesktop.org/wiki/Software/Beignet/optimization-guide/.

#ifndef INFINITY
#define INFINITY 1.0/0
#endif
#ifndef M_PI
#define M_PI 3.14159265358979323846
#endif
void dct_ii(float *x, float *X) {
  float sum = 0.;
  #pragma unroll
  for (int n = 0; n < 128; ++n) {
    sum += x[n];
  }
  X[0] = sum;
  for (uint k = 1; k < 128; ++k) {
    sum = 0.;
  #pragma unroll
    for (int n = 0; n < 128; ++n) {
      sum += x[n] * native_cos((float)(M_PI * (n + .5) * k / 128));
    }
    X[k] = sum;
  }
}

__kernel void test_dct( __global float *gdata, __global float *gres){
    uint gid = get_global_id(0);
    uint idx = gid*128  ;
    float data[128];
    float res[128];
  #pragma unroll
    for(uint i=0; i<128; i++){
        data[i] = gdata[idx+i];
    }
    //for(uint i=5; i<=128; i++){
        dct_ii(data, res);
    //}
  #pragma unroll
    for(uint i=0; i<128; i++){
        gres[idx+i] = res[i];
    }

}

Comment 3 GitLab Migration User 2018-10-12 21:25:09 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/beignet/beignet/issues/41.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.