Summary: | Performance compare to Intel® proprietary realization | ||
---|---|---|---|
Product: | Beignet | Reporter: | ilia <inferrna> |
Component: | Beignet | Assignee: | rongyang <rong.r.yang> |
Status: | RESOLVED MOVED | QA Contact: | |
Severity: | minor | ||
Priority: | medium | ||
Version: | unspecified | ||
Hardware: | x86-64 (AMD64) | ||
OS: | All | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: |
clinfo
Source code |
Created attachment 118912 [details]
Source code
gcc dcttest.c -O3 -lOpenCL -lm -o dcttest
# for 3 available platforms run it as
for i in 0 1 2; do ./dcttest $i; done
I take a quick look to your kernel, there are two improvement points: 1. use native_cos instead of cos, but it would lose precision, if your program is precision sensitive, can't use this method. 2. add some #pragma unroll, data[128] and res[128] are private arrays, beignet will store them in global memory. Because array data's visit are always constant in the loop, so you could add the unroll hint to compiler, then it is promoted to register, could improve performance significant. For more optimization tips, please refer to http://www.freedesktop.org/wiki/Software/Beignet/optimization-guide/. #ifndef INFINITY #define INFINITY 1.0/0 #endif #ifndef M_PI #define M_PI 3.14159265358979323846 #endif void dct_ii(float *x, float *X) { float sum = 0.; #pragma unroll for (int n = 0; n < 128; ++n) { sum += x[n]; } X[0] = sum; for (uint k = 1; k < 128; ++k) { sum = 0.; #pragma unroll for (int n = 0; n < 128; ++n) { sum += x[n] * native_cos((float)(M_PI * (n + .5) * k / 128)); } X[k] = sum; } } __kernel void test_dct( __global float *gdata, __global float *gres){ uint gid = get_global_id(0); uint idx = gid*128 ; float data[128]; float res[128]; #pragma unroll for(uint i=0; i<128; i++){ data[i] = gdata[idx+i]; } //for(uint i=5; i<=128; i++){ dct_ii(data, res); //} #pragma unroll for(uint i=0; i<128; i++){ gres[idx+i] = res[i]; } } -- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/beignet/beignet/issues/41. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.
Created attachment 118911 [details] clinfo Beignet GPU vs Intel® OpenCL CPU vs POCL, secs: 1.351889 vs 1.073667 vs 7.501667 Intel® software is opencl-1.2-5.0.0.57 (for CPU only) Beignet master at 00e207e2a4e428ee1158159ad50c8188f9fe23f0 with llvm-3.7 My task is naive unoptimized dct on bunch of float arrays with size 128. Intel® variant works faster even on 65536 global size - seems like it does some obvious optimizations on dct algorithm, which is very optimizable, as we know. Is it possible to pass some optimization parameters to llvm to get more faster code? Or Intel®'s black magic is impossible to be repeated?