RX Vega 64 Hot Spot Temperature

Docs for my gpu say to reinstall video card drivers after flicking the bios switch

It won't get to POST. :(

To everyone else, I did not flip the switch while it was on. The thing has been crashing frequently, more than before RMA. I simply waited for it to crash before powering down and flipping the switch. I thought my steps taken were clearly lined up but maybe not.
 
You are not able to flash the reduced power mode bios, it is designed like for duel bios backup ability.

It's possible all along that you have been running in the reduced power mode. Switching it to the "performance" mode which is flashable, you have increased power usage, even on boot as most cards boot at max clock speeds as part of their self test set in the cards firmware i believe. If that is the case, it's possible your motherboard is the culprit, having an issue with power delivery, or possible a motherboard bios issue, even though you have the latest as gpu's get 75 watts from the motherboard.


Have you tested the card in a different machine/motherboard?

In all honesty, even though your UPS shows max draw of in the 500's.. it sounds like a power supply issue if it isn't your motherboard. It's also possible you have a faulty pci-e power cable or connector. Either way, I would look at power delivery for an issue. Not the card.
 
I agree the randomness of how it does crash seems like a power issue, but I still think it's an issue with the card. Reasons below:

Operation works in Primary (high wattage) mode. According to this chart https://www.tomshardware.com/reviews/amd-radeon-rx-vega-64,5173.html my power draw matches all three presets when folding a "real" WU.

Card when drawing only 110W ON A 16421 WU will still crash.

Power Saver preset keeps it under 165W regardless. Still crashes.

HD 7970 folded for a month at roughly 200W constant.

So it sounds like there is nothing special I had to do prior to flipping the switch (other than powering the PC off). The card is even more dorked up than I thought. Wish I knew about the switch before RMA. MSI documentation is generic AF.
 
To answer your question, I have not tried it in another motherboard. I have no working motherboards to use, and everyone I know who has a PC does not have an appropriate power supply. :(
 
I agree the randomness of how it does crash seems like a power issue, but I still think it's an issue with the card. Reasons below:

Operation works in Primary (high wattage) mode. According to this chart https://www.tomshardware.com/reviews/amd-radeon-rx-vega-64,5173.html my power draw matches all three presets when folding a "real" WU.

Card when drawing only 110W ON A 16421 WU will still crash.

Power Saver preset keeps it under 165W regardless. Still crashes.

HD 7970 folded for a month at roughly 200W constant.

So it sounds like there is nothing special I had to do prior to flipping the switch (other than powering the PC off). The card is even more dorked up than I thought. Wish I knew about the switch before RMA. MSI documentation is generic AF.

How do you have the card plugged in? Are you using 2 separate pci-e power cables? If not, you need to switch to 2 separate pci-e cables (not one with dual pci-e connectors)


Also keep in mind, many power draw software utilities are not reliable as most don't calculate in power spikes that only last a couple ms. If you are using gpu-z to log the cards power draw, it is not reliable as it only reports what the gpu is telling it, and gpu-z is only reporting the gpu core power draw, excluding the power draw for the rest of the card like memory, etc.

https://www.extremetech.com/computing/310217-how-much-power-do-gpus-actually-consume


You also can't rely on power saving settings, as it isn't a hard limit, meaning it will allow the card to go over 165watts for short periods of time. Not instantly throttle to a limit of 165 watts. It's basically a running average over a set time frame designed to allow for headroom and fluctuations that averages out to 165 watts. Which is also what most power draw utilities also do.. they basically show max draw with the spikes and lowes averaged out. Most power supplies are also rated with such spikes in mind. (Aka headroom to handle fluctuation spikes over it's rating).

The only real way to get a true accurate power draw readings is with special equipment.
 
Last edited:
I use two separate 8-pin cables.

All good points, NWR. Unfortunately, I still think I have too much contrary evidence that my power supply is at fault.

Another reason: I've put it in Turbo mode and core draw is 250W at that point. It is less stable, sure, but it can and will run at 250W for hours without issue. Add CPU folding for another 100W on the power supply and it still runs fine. Until it crashes.

In fact, Saturday morning putting it in Turbo mode kept it from crashing. I thought I found a fix, until it kept dying on me Monday.

It seems that if my power supply had issues supplying power to the GPU (when in Power Saver mode, or when -14% clock, etc.) it would be reasonably replicatable by increasing the load by 100W on the one pair of cables.

I think whatever issue it is, it's related to when the card cools down after being under load. In Red Dead Redemption, it only crashes during cut scenes. In The Division 2, it only crashes when going indoors from outdoors (never the reverse). When folding, it seems to crash when writing a checkpoint (which is why I had such an issue with 16435 WUs that wrote checkpoints every 0.2%; every 45s). All of these are when the GPU temp drops when load is released.

Then there's my UPS readout. With the 7970 I got about 380W w/idle CPU. Vega has definitely died on me with less than that. Mine is a really good power supply that's spec'd for 748.8W on the 12V rail, and EVGA is a brand with a good reputation. I just don't see it being the issue considering the patterns I've been able to put together.
 
I use two separate 8-pin cables.

All good points, NWR. Unfortunately, I still think I have too much contrary evidence that my power supply is at fault.

Another reason: I've put it in Turbo mode and core draw is 250W at that point. It is less stable, sure, but it can and will run at 250W for hours without issue. Add CPU folding for another 100W on the power supply and it still runs fine. Until it crashes.

In fact, Saturday morning putting it in Turbo mode kept it from crashing. I thought I found a fix, until it kept dying on me Monday.

It seems that if my power supply had issues supplying power to the GPU (when in Power Saver mode, or when -14% clock, etc.) it would be reasonably replicatable by increasing the load by 100W on the one pair of cables.

I think whatever issue it is, it's related to when the card cools down after being under load. In Red Dead Redemption, it only crashes during cut scenes. In The Division 2, it only crashes when going indoors from outdoors (never the reverse). When folding, it seems to crash when writing a checkpoint (which is why I had such an issue with 16435 WUs that wrote checkpoints every 0.2%; every 45s). All of these are when the GPU temp drops when load is released.

Then there's my UPS readout. With the 7970 I got about 380W w/idle CPU. Vega has definitely died on me with less than that. Mine is a really good power supply that's spec'd for 748.8W on the 12V rail, and EVGA is a brand with a good reputation. I just don't see it being the issue considering the patterns I've been able to put together.

So it appears it has nothing to do with power draw, but power drop.

When the load on the card is released, so is the power draw, even if it's only a fraction of time. All it takes is a failing power regulator in the power supply to cause such issues. I also have an EVGA power supply, yes, they are good power supplies, but even the best made products have failures.

I know that your 7970 doesn't have the issue, but you are talking about 2 different architectures that effect/ handle power differently, partly due to the architecture and partly due to the speed difference as the Vega 64 is twice as fast as the 7970 which can effect how quickly it stops drawing power, and how quickly it reduces that draw, etc. If the power regulator or some other failure (bad capacitor, ect) in the power supply is failing, all it takes is a minor difference between the two cards to cause issues not connected to max power draw, but the sudden drop in power draw. Basically, it's possible the power supply is not able to handle the transmission spikes or drops quick enough resulting in a crash.

I guess all I am trying to say, is don't assume it's not the power supply. I have had power supplies test good, and in the end, it ended up being the problem.

It could also be an issue with the cards p-stats when it tries to adjust for the drop in load. But again, I wouldn't over look the power supply.

I would also remove the undervolt and any other manual adjustments you have made until you can get the card to run stable without any lockups (run it bare bones stock). The under volting alone can cause lockups.
 
Last edited:
I had a pretty hefty response but lost it all in another crash.

Long story short: getting a new PSU works better for my conundrum (I want Big Navi and I want to build my kids a PC because they're destroying my laptop that I need for travel when I eventually travel again).

If I have a bad PSU, I can RMA it and use the one that comes back for the kids' PC. I lose nothing by buying a 2nd PSU even if Vega still crashes.
 
After 36 hours with a new 850W eVGA G+ PSU, my crashing issues persist.

I had promising results initially. 18 hours straight, the machine folded and did not crumble. RDR2 did not crash.

I typed up a huge post about eating crow, invoking LordHawkwind (from the Vega Owners thread), and a few other things. But I saved it offline because I know how finicky this **** can be. I was so angry and happy at the same time - I didn't know that the two extremes could be experienced simultaneously.

But then there was a crash. Chalked it up to the F@H core. Pretended it didn't happen.

But then there was another crash. Maybe driver issue. Not hardware for sure. These things happen.

But then there was a third crash. Max fan, this time. Just like the old days. Nothing is different.
 
Well, gentlemen (and the odd lady) it's been a solid week of folding with no crashing.

20200816_reliability_history.png


I think it's been long enough that I can announce to you all that I've found a fix.

The fix?

I bought a Sapphire Nitro+ 5700XT :o

This means that

My Vega 64 may make a nice paperweight, but now I can RMA it without being obsessed on when it's going to get back and if it's going to work. Found a way to ship it for $16 and I'll ship it a third time if it comes back bunked. A working Vega 64 will be overkill for my kids PC, and when not in use, be a significant folding machine.

Best to everyone. Will report back if MSI actually fixes my card or when I give up on it.
 
With a whimper, the saga ends.

20200831_rma_end.png


Refund price blanked out because of confidentiality notice at bottom of email. The number they presented is fair.
 
Nice deposit on the next one? Timely.

More like a rebate. I've been rocking a 5700XT for a few weeks.

After much reflection, it's the most perfect outcome. I still want Big Navi, but it made even less sense to get it when I have V64 + 5700XT. With V64 out of the equation and a reasonable sum in my pocket, it works out pretty well assuming Big Navi will be semi-reasonable in price and actually available for purchase. I'm picturing March 2021 timeframe. Give the integrators time to figure out the best ways to cool it like Sapphire's done with this Nitro+.
 
Well there you go! You tried every possible cause short of the card being defective, and in the end, it was. Enjoy your new card!
 
Back
Top