Homework 7 关于Canny的SIMD优化练习 LeeRinji

题目链接

题目简述

针对经典的边缘检测Canny算子,使用串行代码按四个步骤实现其基本功能,再应用SIMD优化串行实现(可使用Intel IPP库),并且分析优化的思路和流程,最终给出实验结果(使用图表总结),考虑优化前后边缘检测算法性能和运行效率有哪些变化。

详细说明

Canny算法

Canny是最早由John F. Canny在1986年提出的边缘检测算法,并沿用至今。

Canny, John. “A computational approach to edge detection.” Readings in Computer Vision. 1987. 184-203.

John F. Canny给出了评价边缘检测性能优劣的三个指标:

  1. Good detection。即要使得标记真正边缘点的失误率和标记非边缘点的错误率尽量低。
  2. Good localization。即检测出的边缘点要尽可能在实际边缘的中心;
  3. Only one response to a single edge。当同一条边有多个响应时,仅能取其一作为标记。即数学上单个边缘产生多个响应的概率越低越好,并且尽量抑制图像噪声产生虚假边缘。

Canny算法是以上述三个指标为优化目标的求解问题的最优解,即在对图像中物体边缘敏感性的同时,也抑制或消除噪声的影响。其主要步骤如下:

  1. Noise Reduction(可使用高斯滤波器去噪)
  2. Finding Intensity Gradient of the Image(可在横纵轴分别用Sobal算子初步计算出两张梯度图,再最终计算出原图梯度的幅值和方向,其中方向最终近似到四个角度0, 45, 90, 135)
  3. Non-maximum Suppression(边缘细化,使其更清晰)
  4. Hysteresis Thresholding(最终使用双阈值来选择边缘像素,生成边缘检测结果)

SIMD

SIMD全称Single Instruction Multiple Data,单指令多数据流,它已经成为Intel处理器的重要性能扩展。目前Intel处理器支持的SIMD技术包括MMX,SS,,AVX,AVX256,AVX512等。

MMX提供了8个64bit的寄存器进行SIMD操作,SSE系列提供了128bit的8个寄存器进行SIMD指令操作,而AVX指令则支持256/512bit的SIMD操作。

目前SIMD指令可以有多种方法进行使用,如下图所示,包括使用编译器的自动向量化(Auto-vectorization)支持、使用编译器指示符(compiler directive)、使用编译器的内建函数(intrinsic)和直接使用汇编语言编写汇编函数再从C++代码中调用汇编函数。

参考资料

实验环境

软件

大部分开发环境安装在WSL上,较之于双系统、虚拟机等其他开发方案,更加方便,也方便直接使用Linux下的一些指令。

硬件

所用机器型号为VAIO Z Flip 2016。

实验过程

编译代码

$ gcc -o canny_edge canny_edge.c hysteresis.c pgm_io.c -lm -fopenmp -fopt-info -O3
canny_edge.c:439:3: note: loop vectorized
canny_edge.c:422:3: note: loop vectorized
canny_edge.c:422:3: note: loop versioned for vectorization because of possible aliasing
canny_edge.c:439:3: note: loop turned into
non-loop; it never loops.
canny_edge.c:439:3: note: loop with 7 iterations completely unrolled
canny_edge.c:422:3: note: loop turned into
non-loop; it never loops.
canny_edge.c:422:3: note: loop with 14 iterations completely unrolled
canny_edge.c:392:6: note: loop turned into
non-loop; it never loops.
canny_edge.c:392:6: note: loop with 7 iterations completely unrolled
canny_edge.c:560:2: note: loop vectorized
canny_edge.c:560:2: note: loop turned into
non-loop; it never loops.
canny_edge.c:560:2: note: loop with 6 iterations completely unrolled
canny_edge.c:536:6: note: loop turned into
non-loop; it never loops.
canny_edge.c:536:6: note: loop with 3 iterations completely unrolled
hysteresis.c:28:2: note: loop turned into non-loop; it never loops.
hysteresis.c:28:2: note: loop with 9 iterations completely unrolled
hysteresis.c:31:48: note: basic block vectorized
hysteresis.c:28:54: note: basic block vectorized
hysteresis.c:28:2: note: loop turned into non-loop; it never loops.
hysteresis.c:28:2: note: loop with 9 iterations completely unrolled
hysteresis.c:98:9: note: Loop 2 distributed: split to 1 loops and 1 library calls.
hysteresis.c:89:11: note: Loop 8 distributed: split to 0 loops and 1 library calls.
hysteresis.c:98:9: note: loop vectorized
hysteresis.c:78:7: note: loop vectorized
hysteresis.c:71:7: note: loop vectorized
hysteresis.c:63:10: note: loop vectorized
hysteresis.c:48:6: note: loop turned into non-loop; it never loops
hysteresis.c:48:6: note: loop turned into non-loop; it never loops.
hysteresis.c:48:6: note: loop with 2 iterations completely unrolled
hysteresis.c:48:6: note: loop turned into non-loop; it never loops
hysteresis.c:48:6: note: loop turned into non-loop; it never loops.
hysteresis.c:48:6: note: loop with 15 iterations completely unrolled
hysteresis.c:48:6: note: loop turned into non-loop; it never loops.
hysteresis.c:48:6: note: loop with 15 iterations completely unrolled
hysteresis.c:48:6: note: loop turned into non-loop; it never loops.
hysteresis.c:48:6: note: loop with 15 iterations completely unrolled
hysteresis.c:28:54: note: basic block vectorized
hysteresis.c:28:54: note: basic block vectorized
hysteresis.c:175:49: note: loop vectorized
hysteresis.c:172:46: note: loop vectorized
hysteresis.c:165:6: note: loop turned into
non-loop; it never loops.
hysteresis.c:165:6: note: loop with 15 iterations completely unrolled
hysteresis.c:165:6: note: loop turned into
non-loop; it never loops.
hysteresis.c:165:6: note: loop with 15 iterations completely unrolled

稍微解释一下某些编译参数。

将输入图片转码成pgm格式

由于实现的算法只支持pgm格式,需要先将输入文件转码:

ffmpeg -i MizunoAi.jpg MizunoAi.pgm

由于老师给的图片尺寸不够大,在我的机器上很难明显显示出并行化优化后加速的效果,这里我使用waifu2x算法生成了一张12000*6748的图片作为测试。当然使用老师提供的图片也是可以正常运行的,只是优化的效果就不太明显了。

测试运行时间

使用time指令来测试运行时间,以下测试时间均为多次测试后得到的稳定时间。

根据原作者写的README和我自己的调参,发现当运行参数设置为2.4 0.5 0.9时有不错的运行效果。

-O3优化

$ time ./canny_edge MizunoAi.pgm 2.4 0.5 0.9

real    0m8.387s
user    0m7.125s
sys     0m1.203s

-O2优化

$ time ./canny_edge MizunoAi.pgm 2.4 0.5 0.9
real    0m9.052s
user    0m7.719s
sys     0m1.281s

-O1优化

$ time ./canny_edge MizunoAi.pgm 2.4 0.5 0.9
real    0m9.640s
user    0m8.266s
sys     0m1.234s

无优化

$ time ./canny_edge MizunoAi.pgm 2.4 0.5 0.9
real    0m20.856s
user    0m19.469s
sys     0m1.250s

运行结果

将图片转化回png格式方便查看:

ffmpeg -i MizunoAi.pgm_s_2.40_l_0.50_h_0.90.pgm MizunoAi.png

下面对比算法的效果。

MizunoAi.jpg