Hi, I am relatively new to hashcat forgive for any foolishness
I was looking through the code to find ways to speedup recovering passwords for WPA2.
I found some, but is it worth it?
The second one is that I see a difference in speedups between NVIDIA/OpenCL (org compared to optimized both with OpenCL) and NVIDIA/CUDA (org compared to optimized both with CUDA).
I cannot explain it. I expected roughly the same speedups.
Result (showing only the relevant speedups from the short benchmark):
WPA-EAPOL-PBKDF2 (2500):
* without CUDA Toolkit installed: 4.14%
* with CUDA Toolkit installed: 2.11%
DPAPI masterkey file v1(15300):
* without CUDA Toolkit installed: 5.01%
* with CUDA Toolkit installed: 1.09%
- Do these results justify a pull request?
- Why are the achieved speedups for CUDA lower compared to OpenCL ?
More details below for the interested readers ..
Speedup stats for other hashmodes in the short benchmark:
Without CUDA Toolkit installed:
Optimization:
SHA1_transform(), SHA1_transform_vector() in inc_hash_sha1.cl
Based on: Exploiting an HMAC-SHA-1 optimization to speedup PBKDF2
Trade-off:
more memory (80 x u32_t instead of 16 x u32_t), less execution time (number of zero-optimizable xors increases by a)not destructively updating w0_t .. wf_t and b) by maximizing the righthand side usage of w0_t..wf_t)
Implemented:
basically equation 14 of the article, leaving still a general function which can be better optimized by the compilers with respect to zero based operations ( a XOR 0 = a, a + 0 = a). The remainder of the article presents some more execution optimizations and a major memory optimization, but these already assume zero based operation optimizations, making the resulting function less general. I still need to give that part of the article a second look.
Measurement method:
New bee (bzzz) questions:
The speedups with CUDA Toolkit do not match those without the toolkit.
I assumed that the same .cl code is used for both with- and without CUDA Toolkit, is this correct ?
Is this explainable ? What am I missing here?
-- CUDA / OpenCL info ----
hashcat --backendinfo (with CUDA Toolkit installed):
hashcat (v5.1.0-1484-gbfd95d42) starting...
CUDA Info:
==========
CUDA.Version.: 10.2
Backend Device ID #1 (Alias: #2)
Name...........: GeForce GTX 1050
Processor(s)...: 5
Clock..........: 1493
Memory.Total...: 2000 MB
Memory.Free....: 1943 MB
OpenCL Info:
============
OpenCL Platform ID #1
Vendor..: NVIDIA Corporation
Name....: NVIDIA CUDA
Version.: OpenCL 1.2 CUDA 10.2.95
Backend Device ID #2 (Alias: #1)
Type...........: GPU
Vendor.ID......: 32
Vendor.........: NVIDIA Corporation
Name...........: GeForce GTX 1050
Version........: OpenCL 1.2 CUDA
Processor(s)...: 5
Clock..........: 1493
Memory.Total...: 2000 MB (limited to 500 MB allocatable in one block)
Memory.Free....: 1920 MB
OpenCL.Version.: OpenCL C 1.2
Driver.Version.: 440.33.01
I was looking through the code to find ways to speedup recovering passwords for WPA2.
I found some, but is it worth it?
The second one is that I see a difference in speedups between NVIDIA/OpenCL (org compared to optimized both with OpenCL) and NVIDIA/CUDA (org compared to optimized both with CUDA).
I cannot explain it. I expected roughly the same speedups.
Result (showing only the relevant speedups from the short benchmark):
WPA-EAPOL-PBKDF2 (2500):
* without CUDA Toolkit installed: 4.14%
* with CUDA Toolkit installed: 2.11%
DPAPI masterkey file v1(15300):
* without CUDA Toolkit installed: 5.01%
* with CUDA Toolkit installed: 1.09%
- Do these results justify a pull request?
- Why are the achieved speedups for CUDA lower compared to OpenCL ?
More details below for the interested readers ..
Speedup stats for other hashmodes in the short benchmark:
Without CUDA Toolkit installed:
- Worst -0.28%
- Best 0.77%
- Avg 0.10%
- Stdev 0.25%
- Worst -0.58%
- Best 0.49%
- Avg -0.16%
- Stdev 0.24%
Optimization:
SHA1_transform(), SHA1_transform_vector() in inc_hash_sha1.cl
Based on: Exploiting an HMAC-SHA-1 optimization to speedup PBKDF2
Trade-off:
more memory (80 x u32_t instead of 16 x u32_t), less execution time (number of zero-optimizable xors increases by a)not destructively updating w0_t .. wf_t and b) by maximizing the righthand side usage of w0_t..wf_t)
Implemented:
basically equation 14 of the article, leaving still a general function which can be better optimized by the compilers with respect to zero based operations ( a XOR 0 = a, a + 0 = a). The remainder of the article presents some more execution optimizations and a major memory optimization, but these already assume zero based operation optimizations, making the resulting function less general. I still need to give that part of the article a second look.
Measurement method:
- i7-7700HQ, GTX-1050 (2GB) , low (laptop) specs I know, but my better one was busy.
- Based on v5.1.0-1484-gbfd95d42, Ubuntu 18.04, see below for CUDA/OpenCL versions
- Using only GPU acceleration.
- Benchmark command:
$ for i in {0..10} ; do hashcat -b --machine-readable ; done > ../result.txt
- From the 11 results discard the first one (warmup), average the remaining 10 per mode/run.
- Do this with:
- original code
- optimized code
- Do the above on a system (is this usefull?):
- without CUDA Toolkit installed
- with CUDA Toolkit installed
- Compute speedup as: percentage(1-(average from org/average from optimized))
New bee (bzzz) questions:
The speedups with CUDA Toolkit do not match those without the toolkit.
I assumed that the same .cl code is used for both with- and without CUDA Toolkit, is this correct ?
Is this explainable ? What am I missing here?
-- CUDA / OpenCL info ----
hashcat --backendinfo (with CUDA Toolkit installed):
hashcat (v5.1.0-1484-gbfd95d42) starting...
CUDA Info:
==========
CUDA.Version.: 10.2
Backend Device ID #1 (Alias: #2)
Name...........: GeForce GTX 1050
Processor(s)...: 5
Clock..........: 1493
Memory.Total...: 2000 MB
Memory.Free....: 1943 MB
OpenCL Info:
============
OpenCL Platform ID #1
Vendor..: NVIDIA Corporation
Name....: NVIDIA CUDA
Version.: OpenCL 1.2 CUDA 10.2.95
Backend Device ID #2 (Alias: #1)
Type...........: GPU
Vendor.ID......: 32
Vendor.........: NVIDIA Corporation
Name...........: GeForce GTX 1050
Version........: OpenCL 1.2 CUDA
Processor(s)...: 5
Clock..........: 1493
Memory.Total...: 2000 MB (limited to 500 MB allocatable in one block)
Memory.Free....: 1920 MB
OpenCL.Version.: OpenCL C 1.2
Driver.Version.: 440.33.01