Skip to content

HashCrackerz/HashCracker-CUDA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

81 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

logo

HashCracker (CUDA)

Parallel SHA-256 Brute Force & Dictionary (salted) Password Cracker

🇬🇧 English | 🇮🇹 Italiano

alt text

alt text

alt text

Project for the Accelerated Computing Systems course at the Master's degree in Computer Engineering, University of Bologna. Parallel application for password cracking through Brute Force attack on SHA-256 hashes (including salted) and dictionary attack, with performance comparison between Sequential (CPU) and Parallel (GPU/CUDA) implementations.

📝 Description

The project implements a password cracker that supports different attack modes to reverse SHA-256 hashes. The main goal is to demonstrate the speedup achievable by moving from serial execution on CPU to massively parallel execution on GPU, analyzing different CUDA memory optimization strategies (Global vs Constant Memory) and computational resources.

⚙️ Features

  • Incremental Brute Force: Dynamic password generation given a charset and length range (min-max).
  • Dictionary Attack: Support for external wordlists.
  • Salt Support: Handling of salted hashes (Brute Force and dictionary attack).
  • Multi-Platform: Native CUDA code for NVIDIA and (semi) automatic porting script for AMD HIP.

📂 Project Structure

  • ASSETS/: contains files used for cracking (charset and dictionary).
  • Sequenziale/: Sequential reference implementation (uses OpenSSL).
  • CUDA_NAIVE/: First GPU implementation (global memory usage).
  • CUDAv1/: Memory optimization (Constant Memory usage for charset and target).
  • CUDAv2/: Kernel optimization (loop unrolling, register optimization for SHA-256).
  • UTILS/: Support functions (file I/O, argument parsing).
  • SHA256_CUDA/: CUDA implementation for SHA256, based on mochimodev's implementation.
  • SHA256_CUDA_OPT/: Optimized CUDA implementation for SHA256 (used by CUDAv2).
  • ESTENSIONE/: contains the implementation of the project extension, i.e., dictionary attack and hash cracking with salt (called from kernel_estensione.cu).
  • kernel_[project_version].cu: file to run the corresponding version. All CUDA[project_version] versions (executed by their respective kernel files) depend on UTILS and SHA256_CUDA files, except for CUDAv2 and extension which use SHA256_CUDA_OPT instead of SHA256_CUDA.
    Note: the dictionary used is a trimmed version to passwords of length 64 of rockyou.txt. Our version is available in the ASSETS folder to be unzipped (due to GitHub limits).

🛠️ Requirements

  • Hardware:
    • NVIDIA GPU (Compute Capability 5.0+)
  • Software:
    • NVIDIA CUDA Toolkit (11.0+)
    • OpenSSL (for CPU implementation)
    • C++ Compiler (MSVC on Windows, GCC/Clang on Linux)

🚀 Compilation

NVIDIA CUDA

Make sure OpenSSL libraries are linked correctly.

nvcc -arch=sm_89 -rdc=true -O3 \
    kernel_naive.cu \
    CUDA_NAIVE/*.cu \
    SHA256_CUDA/*.cu \
    UTILS/*.cu UTILS/*.cpp \
    -o naive_cuda \
    -lssl -lcrypto -lcudadevrt -I.

(change file names and dependencies based on the version to compile)

💻 Usage

The program accepts command line parameters for maximum flexibility:

./brute_force_cuda [<blockSize>] <hash_target> <min_len> <max_len> <file_charset> [<salt> <dictionary-yes/no> <dictionary_file>]

The blockSize must always be passed only in parallel GPU scripts (both CUDA and HIP).
The dictionary (flag and file path) and salt must be passed only in extension scripts.
Note: in the extension version max_len includes the salt length.

Example:

Search for the password of the hash (corresponding to "qwerty") with length 6, using the standard charset:

./brute_force_cuda 256 qwerty 1 6 ASSETS/CharSet.txt az No

📊 Performance Analysis

Tests were conducted on:

  • sequential: Ryzen 9 9900X
  • CUDA: NVIDIA RTX 4060 Laptop and partially Google Colab

Technical Deep Dive: Analysis

The SHA-256 algorithm is heavily Compute-Bound. The v2 implementation heavily uses registers to maintain the hash state and avoid local/global memory latencies. Although the high number of registers (118) limits the number of active warps (low occupancy), the single thread execution speed increases drastically. In this scenario, maximizing IPC (Instructions Per Cycle) proved more effective than maximizing parallelism at the latency level (Occupancy).

Furthermore, the use of smaller block sizes (64/128 threads) led to better performance compared to the classic 256, thanks to better management of the Tail Effect (wave quantization) and lower scheduling overhead.

The extension implementation has essentially the same performance as v2 (since it uses practically the same code), with the addition that for dictionary attack, the time in case of hit is certainly lower than testing all combinations.

👥 Authors

📜 License

This project is distributed under the AGPL license. See the LICENSE file for details.

About

Hash Cracker that uses CUDA parallelism.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors