Project for the Accelerated Computing Systems course at the Master's degree in Computer Engineering, University of Bologna. Parallel application for password cracking through Brute Force attack on SHA-256 hashes (including salted) and dictionary attack, with performance comparison between Sequential (CPU) and Parallel (GPU/CUDA) implementations.
The project implements a password cracker that supports different attack modes to reverse SHA-256 hashes. The main goal is to demonstrate the speedup achievable by moving from serial execution on CPU to massively parallel execution on GPU, analyzing different CUDA memory optimization strategies (Global vs Constant Memory) and computational resources.
- Incremental Brute Force: Dynamic password generation given a charset and length range (min-max).
- Dictionary Attack: Support for external wordlists.
- Salt Support: Handling of salted hashes (Brute Force and dictionary attack).
- Multi-Platform: Native CUDA code for NVIDIA and (semi) automatic porting script for AMD HIP.
ASSETS/: contains files used for cracking (charset and dictionary).Sequenziale/: Sequential reference implementation (uses OpenSSL).CUDA_NAIVE/: First GPU implementation (global memory usage).CUDAv1/: Memory optimization (Constant Memory usage for charset and target).CUDAv2/: Kernel optimization (loop unrolling, register optimization for SHA-256).UTILS/: Support functions (file I/O, argument parsing).SHA256_CUDA/: CUDA implementation for SHA256, based on mochimodev's implementation.SHA256_CUDA_OPT/: Optimized CUDA implementation for SHA256 (used by CUDAv2).ESTENSIONE/: contains the implementation of the project extension, i.e., dictionary attack and hash cracking with salt (called fromkernel_estensione.cu).kernel_[project_version].cu: file to run the corresponding version. All CUDA[project_version] versions (executed by their respective kernel files) depend onUTILSandSHA256_CUDAfiles, except for CUDAv2 and extension which useSHA256_CUDA_OPTinstead ofSHA256_CUDA.
Note: the dictionary used is a trimmed version to passwords of length 64 of rockyou.txt. Our version is available in theASSETSfolder to be unzipped (due to GitHub limits).
- Hardware:
- NVIDIA GPU (Compute Capability 5.0+)
- Software:
- NVIDIA CUDA Toolkit (11.0+)
- OpenSSL (for CPU implementation)
- C++ Compiler (MSVC on Windows, GCC/Clang on Linux)
Make sure OpenSSL libraries are linked correctly.
nvcc -arch=sm_89 -rdc=true -O3 \
kernel_naive.cu \
CUDA_NAIVE/*.cu \
SHA256_CUDA/*.cu \
UTILS/*.cu UTILS/*.cpp \
-o naive_cuda \
-lssl -lcrypto -lcudadevrt -I.(change file names and dependencies based on the version to compile)
The program accepts command line parameters for maximum flexibility:
./brute_force_cuda [<blockSize>] <hash_target> <min_len> <max_len> <file_charset> [<salt> <dictionary-yes/no> <dictionary_file>]The blockSize must always be passed only in parallel GPU scripts (both CUDA and HIP).
The dictionary (flag and file path) and salt must be passed only in extension scripts.
Note: in the extension version max_len includes the salt length.
Example:
Search for the password of the hash (corresponding to "qwerty") with length 6, using the standard charset:
./brute_force_cuda 256 qwerty 1 6 ASSETS/CharSet.txt az NoTests were conducted on:
- sequential: Ryzen 9 9900X
- CUDA: NVIDIA RTX 4060 Laptop and partially Google Colab
The SHA-256 algorithm is heavily Compute-Bound. The v2 implementation heavily uses registers to maintain the hash state and avoid local/global memory latencies. Although the high number of registers (118) limits the number of active warps (low occupancy), the single thread execution speed increases drastically. In this scenario, maximizing IPC (Instructions Per Cycle) proved more effective than maximizing parallelism at the latency level (Occupancy).
Furthermore, the use of smaller block sizes (64/128 threads) led to better performance compared to the classic 256, thanks to better management of the Tail Effect (wave quantization) and lower scheduling overhead.
The extension implementation has essentially the same performance as v2 (since it uses practically the same code), with the addition that for dictionary attack, the time in case of hit is certainly lower than testing all combinations.
This project is distributed under the AGPL license. See the LICENSE file for details.
