Quantcast
Channel: hashcat Forum - All Forums
Viewing all articles
Browse latest Browse all 7822

Three hd7970 = OS hang

$
0
0
Hey Guys,

recently three of my older cards brick (1 x hd5970, 1 x hd6990 and 1 x hd7970) so bought three hd7970's.

The new cards are all GIGABYTE GV-7970C-3GD which are overclocked to 1000mhz by vendor.

Problem is that when running them in parallel the OS hangs after 1-2 minutes.

But I think the GPUs are ok because if I run them solo with -d 1, -d 2 and -d 3 the OS does not hang. It's only if I run at least two in parallel the OS hangs. It doesn't matter if they run it in a single or in multiple oclHashcat instances.

To sort out the problem i tried a lot of different scenarios but now I am out of ideas Sad

First I tried them on two different systems that I use for a couple of time with other GPUs:

1st:

- Intel I7 4770k
- ASUS Z87 Expert
- Ubuntu 14.04 lts, 64 bit

2nd:

- Intel I7 4770k
- ASUS Z87 A
- Ubuntu 12.04 lts, 64 bit

On both systems the behavior is exactly the same, and since I used these system with other cards my feeling tells me there's no hardware defect on the boards, cpus or rams.

More Information about hardware:

- The original cooler as been removed and replaced with EK Watercooling blocks. They are connected in serial with a watercooling bridge. The cooling flow works fine
- There are no extender cables/risers involved. All cards sit directly on the board
- All cards run headless, none of them is connected to a monitor

Heat: The GPU's run at ~40c on idle and increase to ~55c under load before the OS hangs. There's no special heat threshold that lead to OS hangs, it happens somewhere > 50c.
Power: each GPU has a dedicated 700W power supply

Things I tried to change:

- Tryed with catalyst 14.4 and 14.6 beta on both systems, always ran amdconfig --initial -f --adapter=all afterwards and rebooted
- Updated the Mainboards bios to the latest versions (1803)
- Updated the GPUs bios to the latest versions (F72)
- Manually switched PCI-E settings in bios to from x16 to x1
- Manually disabled ASPM in bios
- Underclocked the cards to stock hd7970 settings (925/1375)
- Attached original fan to fan-plug on the cards
- Switched the GPU positions from 1 to 2, 2 to 3, 3 to 1, etc..
- Disabled iommu on kernel commandline
- Blacklisted mei and mei_me modules
- Tried only with 2 cards
- Tried both, ALU intensive and memory intensive algorithm

One thing to note is that when I disabled X11 (so that ADL can't work and oclHashcat can not read temps etc) it looks like this:

Quote:Speed.GPU.#1...: 15891.9 MH/s
Speed.GPU.#2...: 0 H/s
Speed.GPU.#3...: 0 H/s
Speed.GPU.#*...: 15891.9 MH/s

... and when I then continiously press "s" it seems #1 continues to work ...

But when I have X11 enabled and temps are read, it always looks like this:

Quote:[s]tatus [p]ause [r]esume [b]ypass [q]uit =>

Speed.GPU.#1...: 15873.6 MH/s
Speed.GPU.#2...: 15871.1 MH/s
Speed.GPU.#3...: 15880.4 MH/s
Speed.GPU.#*...: 47625.2 MH/s
Recovered......: 0/1 (0.00%) Digests, 0/1 (0.00%) Salts
Progress.......: 1567903186944/6634204312890625 (0.02%)
Skipped........: 0/1567903186944 (0.00%)
Rejected.......: 0/1567903186944 (0.00%)
HWMon.GPU.#1...: 98% Util, 41c Temp, 29% Fan
HWMon.GPU.#2...: 98% Util, 41c Temp, 29% Fan
HWMon.GPU.#3...: 98% Util, 42c Temp, 31% Fan

[s]tatus [p]ause [r]esume [b]ypass [q]uit =>

ERROR: Temperature limit on GPU 2 reached, aborting...

The system is completely frozen at this point.

Another interessting thing is by looking at the lspci output the cards run on a different PCI-E speed and ignoring my manual x1 setting from bios:

Quote:root@et:~# lspci -vv | grep -e "VGA " -e Width
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Tahiti XT [Radeon HD 7970/8970 OEM / R9 280X] (prog-if 00 [VGA controller])
LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
LnkSta: Speed 2.5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
LnkSta: Speed 2.5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
02:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Tahiti XT [Radeon HD 7970/8970 OEM / R9 280X] (prog-if 00 [VGA controller])
LnkCap: Port #1, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
LnkSta: Speed 2.5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
LnkCap: Port #1, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
LnkSta: Speed 2.5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s unlimited, L1 <64us
LnkSta: Speed 2.5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
05:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Tahiti XT [Radeon HD 7970/8970 OEM / R9 280X] (prog-if 00 [VGA controller])
LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
LnkSta: Speed 2.5GT/s, Width x2, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
LnkCap: Port #0, Speed 8GT/s, Width x16, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
LnkSta: Speed 2.5GT/s, Width x2, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-

This is from ubuntu 14.04.

At this point I'm stuck. I tried everything I could think of. The only thing that's left from my view is that the Z87 * boards are not capable to run three 7970 but that sounds to strange...

Viewing all articles
Browse latest Browse all 7822

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>