For those who aren't following hashcat twitter, I'm experimenting with bitslice for DES/LM (-m 3000) on oclHashcat v1.38. So far everything is working fine but while working on DEScrypt (-m1500) something really strange happend.
I have a test kernel which I am using for my experiments before fully integrating into oclHashcat. I'm using it to identify performance bottlenecks in an early stage. I finalized it and started porting. Ported to AMD, everything is fine. We're at 470mh/s on a single 290x stock clocked (yay worlds fastest!). This is an extreme improvement on which I'm a bit proud because we "only" got 170MH/s on this algorithm so far. Still, I don't know yet how to distribute this kernel because I'm forced to hardcode the salt (and therefor generate 4096 kernels for each architecture). Maybe the distribution as source is the only way to do it. Anyway, I also need to thank Sc00bz for explaining how those e-box are to be used.
Then I started porting to CUDA. But there is the problem, speeds are slower than expected. I guess you all heard the news on 950MH/s with descrypt on a 980Ti etc. I was sceptical with my first implementation so I rewrote the entire thing on CUDA, just to ensure not have a bug somewhere. But no, it turned out there is none. The culprit is nvcc! So what I have here is a kernel who's body is 1:1 the same code on both OpenCL and CUDA (Yes, NVidia has an OpenCL runtime too). When I tested it on NVidias OpenCL, the speed was much better than on their own CUDA?!?! WTH is going on...
To give you some numbers, we're at 73MH/s on CUDA and 110MH/s on OpenCL, measured on a 750Ti. OpenCL speed on a 980Ti is around 350 MH/s. But what I'm trying to say here is that there's something wrong with nvcc compiler. To proof it I had to do some tricks since it's not possible to compile OpenCL code with nvcc but it's possible to dump a OpenCL kernel from NVidias OpenCL runtime! I then have compiled the OpenCL kernel, dumped it, and because it's 1:1 the same code as for CUDA (including the parameters), I was able to load the pure .ptx kernel from cudaHashcat. The resulting speed is about 350 MH/s on CUDA and hashes are cracking.
The problem is the OpenCL runtime for NVidia. There's no way to tell the compiler to generate code for a specific GPU archicture. But due to our binary kernel distribution we really need that feature!
One last thing, I know you're gonna ask: Yes, I'm using lop3 for sboxes. Reported speeds on other projects doing 950MH/s on descrypt or pure DES with lop3 are not reproduceable. Not even the pure sboxes inside a minimalistic kernel on a standalone platform. Feel free to try it yourself. What you really get is 470MH/s on a 290x and 350MH/s on a 980Ti , and just this is some real improvement.
I have a test kernel which I am using for my experiments before fully integrating into oclHashcat. I'm using it to identify performance bottlenecks in an early stage. I finalized it and started porting. Ported to AMD, everything is fine. We're at 470mh/s on a single 290x stock clocked (yay worlds fastest!). This is an extreme improvement on which I'm a bit proud because we "only" got 170MH/s on this algorithm so far. Still, I don't know yet how to distribute this kernel because I'm forced to hardcode the salt (and therefor generate 4096 kernels for each architecture). Maybe the distribution as source is the only way to do it. Anyway, I also need to thank Sc00bz for explaining how those e-box are to be used.
Then I started porting to CUDA. But there is the problem, speeds are slower than expected. I guess you all heard the news on 950MH/s with descrypt on a 980Ti etc. I was sceptical with my first implementation so I rewrote the entire thing on CUDA, just to ensure not have a bug somewhere. But no, it turned out there is none. The culprit is nvcc! So what I have here is a kernel who's body is 1:1 the same code on both OpenCL and CUDA (Yes, NVidia has an OpenCL runtime too). When I tested it on NVidias OpenCL, the speed was much better than on their own CUDA?!?! WTH is going on...
To give you some numbers, we're at 73MH/s on CUDA and 110MH/s on OpenCL, measured on a 750Ti. OpenCL speed on a 980Ti is around 350 MH/s. But what I'm trying to say here is that there's something wrong with nvcc compiler. To proof it I had to do some tricks since it's not possible to compile OpenCL code with nvcc but it's possible to dump a OpenCL kernel from NVidias OpenCL runtime! I then have compiled the OpenCL kernel, dumped it, and because it's 1:1 the same code as for CUDA (including the parameters), I was able to load the pure .ptx kernel from cudaHashcat. The resulting speed is about 350 MH/s on CUDA and hashes are cracking.
The problem is the OpenCL runtime for NVidia. There's no way to tell the compiler to generate code for a specific GPU archicture. But due to our binary kernel distribution we really need that feature!
One last thing, I know you're gonna ask: Yes, I'm using lop3 for sboxes. Reported speeds on other projects doing 950MH/s on descrypt or pure DES with lop3 are not reproduceable. Not even the pure sboxes inside a minimalistic kernel on a standalone platform. Feel free to try it yourself. What you really get is 470MH/s on a 290x and 350MH/s on a 980Ti , and just this is some real improvement.