PCIe transfer compression and OpenCL kernel

While trying to find the cause of the performance loss of WPA in version 4.x discussed in this thread https://hashcat.net/forum/thread-7251.html we found the root of the problem and were able to build a solution.

This solution base on a new technique to copy the data from the host system to the GPU memory, thus this post.

The problem was introduced while increasing the maximum password length from 64 (in theory, in reality 16-55) to 256 (always). The data type which holds a single password candidate increased from 68 byte to 260 byte. While we transfer the data it blocks the GPU from compute. In other words: The larger the buffer we want to transfer the longer it takes to copy data. The longer it takes to copy the data the less busy the GPU is. The less busy the GPU is the slower the cracking is.

The increase of 200 byte per password candidate doesn't sound very much but it actually is a lot. To give you some numbers and a feeling on how this affects the performance, here's an example: My GTX1080 (cracking WPA) did ~360kH/s with v3.40 and now does ~300kH/s. This is a 20% loss! It is imporant to note that this GTX1080 is connected via PCIe extender and therefore only gets x1 PCIe multiplier (instead of x16 multiplier if connected directly to the mainboard).

So what to do? We can't just roll back because we want to support long passwords but we also wont tolerate the drop in speed. What about a special compression function? The goal is to transfer less data via PCIe by compressing the "padded data" of each candidate. On average a password candidate only uses 8 byte of the 260 byte that are transfered, the rest is padded with zeros which in turn wastes 252 byte. It would therefor make sense to get rid of the extra zeros, like for example with some kind of "compression" algorithm.

I've designed and implemented such a compression function on both the host and the GPU. Instead of writing the candidates into a huge buffer that is pre-aligned for cracking tasks, it copies the candidates onto a stream (which is still aligned but only to 4 byte), while maintaining an index table at the same time in an extra buffer. This way we have to copy two buffers, but they are (when add together) much smaller than the single huge one (the padded one). But then, for the cracking kernel, to work optimally and as fast as possible, it still needs to access the password candidates in an aligned fashion (with the padded zeros). To achieve this I implemented a new "decompression" kernel which runs on the GPU side and is called after both buffers have been copied but before the actual cracking kernel starts. This kernel reads the index table and the password candidate stream and rebuilds that one huge padded password candidate buffer. Since this action is done completely ony the GPU (buffer reads and writes) no data is transfered via PCI-Express and achieves a very high decompression speed since it can be parallelized perfectly.

The results look as following:

MD5:

v4.0.1 -> v4.1.0 (750Ti @ pci x16): 5.31 MH/s -> 6.82 MH/s + 28%
v4.0.1 -> v4.1.0 (1080 @ pci x1): 1.47 MH/s -> 8.45 MH/s + 574%

Wordpress:

v4.0.1 -> v4.1.0 (750Ti @ pci x16): 805 kH/s -> 839 kH/s + 1%
v4.0.1 -> v4.1.0 (1080 @ pci x1): 1.21 MH/s -> 3.86 MH/s + 319%

WPA:

v4.0.1 -> v4.1.0 (750Ti @ pci x16): 60 kH/s -> 61 kH/s + 1%
v4.0.1 -> v4.1.0 (1080 @ pci x1): 298 kH/s -> 357 kH/s + 17%

As you can see, this change has a very large effect on PCIe x1, but also some gains on PCIe x16.

Of course it works with all hash-modes in combination with dictionary based attacks, not just the three hash-modes in the example.

- atom

PCIe transfer compression and OpenCL kernel

Trending Articles

Scuffham Amps - S-GEAR 2.6.0 VST, AAX, STANDALONE x86 x64 (R2R NO iLok2, +NO...

Practice Sheet of Right form of verbs for HSC Students

VHSE First (1st) Allotment 2025 - vhscap.kerala.gov.in

UNIVERSE LEAGUE – UNIVERSE LEAGUE – WAR (We Are Ready) – EP [iTunes Plus M4A]

City Hunter Teledrama – Episode 18 – 07th May 2016

Comment on Proposed Criteria for Identifying Predatory Conferences by Luke...

Bureau of Internal Revenue: Regional Offices (Directory)

Kendrick Lamar – Not Like Us (2024) [24Bit-88.2kHz] [PMEDIA] ⭐️

Inception 2010 Hindi Dual Audio 650MB BRRip 720p ESubs HEVC

East Hull MD admits sexual assaults after another victim comes forward

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

R. v. Sargeant, 2023 ONSC 6406 (CanLII)

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Who’s been sentenced at Northampton Magistrates’ Court

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Family cries out as traditional ruler allegedly abducts brother, extorts N2.5m

Long-Running Conflict In Springfield (MA) Gangland Sphere Has Manzi Family &...

Wondershare Filmora X v10.1.20.16 x64

Man arrested after fracas in flat

Man charged in ongoing Sexual Assault Investigation Derek Nyilas, 46, Faces...