This project benchmarks three of my optimized implementations of the Floyd-Warshall all-pairs shortest paths algorithm for an 8-core CPU. The three versions are: naive serial, cache-friendly tiled, and OpenMP-parallelized with cache-friendly tiling. The naive serial version was used as the baseline benchmark for compparisons. My optimizations achieved over 6x speedup on an 8-core CPU. Also implemented a CUDA-accelerated GPU optimization of the algorithm using GPU global memory, massive block and thread-level parallelism. My GPU optimization achieved 20 times speedup over the naive serial implementation. All results and benchmarking were done across many graph sizes and all results verified using the standard naive implementation to ensure correctness. For more information on methods, results and background, read my paper by clicking this LINK.
- Serial implementation for baseline benchmarking and results verification
- Cache-friendly blocked/tiled implementation for improved cache locality and performance on CPU
- Blocked implementation with OpenMP for parallel CPU acceleration along with improved cache locality (6x speedup)
- CUDA kernel with massive block and thread-level paralleism for GPU-accelerated performance (20x speedup)
- Automatic timing across 9 graph sizes (4×4 to 3072×3072)
- Output includes execution time printed in CSV format, and max error compared to the ground truth for result verification
gcc -O1 -fopenmp parallel_fw.c -lrt -o parallel_fw
OMP_NUM_THREADS=4 ./parallel_fw
# USE THIS FOR K40m: (change accordingly for your GPU architecture)
nvcc -arch=sm_35 parallel_fw_cuda.cu -o parallel_fw_cuda
Run: ./parallel_fw_cuda
# Use nvidia-smi to find relevant information on your GPU's architecture