针对经典的边缘检测Canny算子,使用串行代码按四个步骤实现其基本功能,再应用SIMD优化串行实现(可使用Intel IPP库),并且分析优化的思路和流程,最终给出实验结果(使用图表总结),考虑优化前后边缘检测算法性能和运行效率有哪些变化。
Canny是最早由John F. Canny在1986年提出的边缘检测算法,并沿用至今。
Canny, John. “A computational approach to edge detection.” Readings in Computer Vision. 1987. 184-203.
John F. Canny给出了评价边缘检测性能优劣的三个指标:
Canny算法是以上述三个指标为优化目标的求解问题的最优解,即在对图像中物体边缘敏感性的同时,也抑制或消除噪声的影响。其主要步骤如下:
SIMD全称Single Instruction Multiple Data,单指令多数据流,它已经成为Intel处理器的重要性能扩展。目前Intel处理器支持的SIMD技术包括MMX,SS,,AVX,AVX256,AVX512等。
MMX提供了8个64bit的寄存器进行SIMD操作,SSE系列提供了128bit的8个寄存器进行SIMD指令操作,而AVX指令则支持256/512bit的SIMD操作。
目前SIMD指令可以有多种方法进行使用,如下图所示,包括使用编译器的自动向量化(Auto-vectorization)支持、使用编译器指示符(compiler directive)、使用编译器的内建函数(intrinsic)和直接使用汇编语言编写汇编函数再从C++代码中调用汇编函数。
大部分开发环境安装在WSL上,较之于双系统、虚拟机等其他开发方案,更加方便,也方便直接使用Linux下的一些指令。
所用机器型号为VAIO Z Flip 2016。
$ gcc -o canny_edge canny_edge.c hysteresis.c pgm_io.c -lm -fopenmp -fopt-info -O3
canny_edge.c:439:3: note: loop vectorized
canny_edge.c:422:3: note: loop vectorized
canny_edge.c:422:3: note: loop versioned for vectorization because of possible aliasing
canny_edge.c:439:3: note: loop turned into
non-loop; it never loops.
canny_edge.c:439:3: note: loop with 7 iterations completely unrolled
canny_edge.c:422:3: note: loop turned into
non-loop; it never loops.
canny_edge.c:422:3: note: loop with 14 iterations completely unrolled
canny_edge.c:392:6: note: loop turned into
non-loop; it never loops.
canny_edge.c:392:6: note: loop with 7 iterations completely unrolled
canny_edge.c:560:2: note: loop vectorized
canny_edge.c:560:2: note: loop turned into
non-loop; it never loops.
canny_edge.c:560:2: note: loop with 6 iterations completely unrolled
canny_edge.c:536:6: note: loop turned into
non-loop; it never loops.
canny_edge.c:536:6: note: loop with 3 iterations completely unrolled
hysteresis.c:28:2: note: loop turned into non-loop; it never loops.
hysteresis.c:28:2: note: loop with 9 iterations completely unrolled
hysteresis.c:31:48: note: basic block vectorized
hysteresis.c:28:54: note: basic block vectorized
hysteresis.c:28:2: note: loop turned into non-loop; it never loops.
hysteresis.c:28:2: note: loop with 9 iterations completely unrolled
hysteresis.c:98:9: note: Loop 2 distributed: split to 1 loops and 1 library calls.
hysteresis.c:89:11: note: Loop 8 distributed: split to 0 loops and 1 library calls.
hysteresis.c:98:9: note: loop vectorized
hysteresis.c:78:7: note: loop vectorized
hysteresis.c:71:7: note: loop vectorized
hysteresis.c:63:10: note: loop vectorized
hysteresis.c:48:6: note: loop turned into non-loop; it never loops
hysteresis.c:48:6: note: loop turned into non-loop; it never loops.
hysteresis.c:48:6: note: loop with 2 iterations completely unrolled
hysteresis.c:48:6: note: loop turned into non-loop; it never loops
hysteresis.c:48:6: note: loop turned into non-loop; it never loops.
hysteresis.c:48:6: note: loop with 15 iterations completely unrolled
hysteresis.c:48:6: note: loop turned into non-loop; it never loops.
hysteresis.c:48:6: note: loop with 15 iterations completely unrolled
hysteresis.c:48:6: note: loop turned into non-loop; it never loops.
hysteresis.c:48:6: note: loop with 15 iterations completely unrolled
hysteresis.c:28:54: note: basic block vectorized
hysteresis.c:28:54: note: basic block vectorized
hysteresis.c:175:49: note: loop vectorized
hysteresis.c:172:46: note: loop vectorized
hysteresis.c:165:6: note: loop turned into
non-loop; it never loops.
hysteresis.c:165:6: note: loop with 15 iterations completely unrolled
hysteresis.c:165:6: note: loop turned into
non-loop; it never loops.
hysteresis.c:165:6: note: loop with 15 iterations completely unrolled
稍微解释一下某些编译参数。
-lm
,为正常使用sqrt
函数,需要链接到数学库。-fopenmp
,打开openmp
的支持,因为在这里我是使用编译制导#pragma omp simd
来将原来的算法simd
化的。-fopt-info
,显示被优化的部分。可以看到上面的输出中,很多循环和代码块被向量化了。-O3
,启用空间换速度的代码优化。之所以要开启O3
选项,是因为simd向量化通常是需要内存对齐的,因此会需要额外的空间。作为对比,关闭-O3
的时候没有被向量化(没有输出),而只开到-O2
的时候只有六个循环被向量化(-O3
会将某些内部循环展开,使得更多的可被向量化的语句被发现)。此外,还有一个优化级别最高的-Ofast
,经过测试,向量化语句的数量和-O3
一样是14个,而这一级别的优化却可能会使得算法的输出不符合预期,因此没有选用。由于实现的算法只支持pgm格式,需要先将输入文件转码:
ffmpeg -i MizunoAi.jpg MizunoAi.pgm
由于老师给的图片尺寸不够大,在我的机器上很难明显显示出并行化优化后加速的效果,这里我使用waifu2x算法生成了一张12000*6748
的图片作为测试。当然使用老师提供的图片也是可以正常运行的,只是优化的效果就不太明显了。
使用time
指令来测试运行时间,以下测试时间均为多次测试后得到的稳定时间。
根据原作者写的README和我自己的调参,发现当运行参数设置为2.4 0.5 0.9
时有不错的运行效果。
$ time ./canny_edge MizunoAi.pgm 2.4 0.5 0.9
real 0m8.387s
user 0m7.125s
sys 0m1.203s
$ time ./canny_edge MizunoAi.pgm 2.4 0.5 0.9
real 0m9.052s
user 0m7.719s
sys 0m1.281s
$ time ./canny_edge MizunoAi.pgm 2.4 0.5 0.9
real 0m9.640s
user 0m8.266s
sys 0m1.234s
$ time ./canny_edge MizunoAi.pgm 2.4 0.5 0.9
real 0m20.856s
user 0m19.469s
sys 0m1.250s
将图片转化回png格式方便查看:
ffmpeg -i MizunoAi.pgm_s_2.40_l_0.50_h_0.90.pgm MizunoAi.png
下面对比算法的效果。
MizunoAi.jpg |
---|