bilateral performance tuning: SLM method

Use SLM, we will load the shared local memory and add a barrier to share the reading bandwidth.
And also use pragma unroll to avoid manually open the loop.

Now the performance can reach 68fps in ivybridge GT2 which have 24 EUs, and keep 13fps in baytraili.

Signed-off-by: Juan Zhao <juan.j.zhao@intel.com>
2 files changed