bi-laterial denoise performance refine

operate 4 pixels at the same time and remove the private array.
The performance could reach 40 in GT2.
2 files changed