Created attachment 112474 [details]
Running on "Intel(R) HD Graphics Haswell GT2 Mobile" (Gen7.5). Tested with Release_v1.0.0 and with latest master (786da41).
The attached patch adds a test which essentially computes the dot-product of two 16-element arrays, slightly unrolled so it does two multiplications per loop iteration. The inputs are 16-bit and the sum is 64-bit.
Every work item does exactly the same computation. It runs with global and local work size 16.
The output shows the first 8 work items get the correct result, but the next 8 get the wrong result.
If I un-unroll the loop (change "i += 2" to 1, and remove the "sum += b0 * b1") then it gives the correct output.
If I change "long sum = 0" to "int sum = 0", then it gives the correct output.
It seems a post register allocation bug. You can disable the post register allocation by set the following environment:
# export OCL_POST_ALLOC_INSN_SCHEDULE=0
And try again. But that will cause about 8% performance regression.
I will fix it soon. Thanks for reporting this.
I just submitted a patch to the mail list, the patch is at:
Could you try it at your side?
That seems to fix it for me - thanks!
The following patch has been pushed to the master and Release_v1.0 branches.
Author: Zhigang Gong <firstname.lastname@example.org>
Date: Tue Jan 20 14:40:39 2015 +0800
GBE: fix an ACC register related instruction scheduling bug
Some instructions modify the ACC register in the gen_context
stage which's not regonized by current instruction scheduling
algorithm. This patch fix this bug by checking all the possible
SEL_OPs which may change the ACC implicitly.
The corresponding bugzilla link is as below:
Signed-off-by: Zhigang Gong <email@example.com>
Reviewed-by: "Yang, Rong R" <firstname.lastname@example.org>