RX Vega 64 Hot Spot Temperature

Crawdaddy79

Radeon 8500
Rage3D Subscriber
Hello gentlemen and the lady or two,

I have two related questions that I hope can be answered without me having to resort to going to Reddit.

What is the highest hot spot temperature you've seen on your Vega 64? What is the maximum temperature it can get to before your card turns itself off in a max-fan-shutdown fashion?

GPU-Z seems to be the best application to find this, except it doesn't log real-time and instead dumps the data when you tell it to stop logging. This is not good for seeing what temps are before a crash.

Under load, I am seeing regular hits of 106C and peaks of 108C while the GPU core temp stays at a comfortable 80C. Previously to learning about hotspots today, I was leaning towards the mas-fan-shutdown being a driver glitch because it will happen even when the GPU is reporting as low as 74C. I now think it may be heat related to this dastardly array of hot spots that my MSI AirBoost OC may not properly cool.

What little I find on the Vega 64 hot spot is sourceless posts on reddit, sometimes referencing other reddit posts that have no source. The 5700XT apparently can hit 110C and be okay.
 
When I looked at mine when I first got it seemed under 100c at all times. But undervolting it to 1.1v really brought temps and power consumption down a lot. It didnt crash before undervolting but it was slower. About 1580 mhz core under load. Now I get 1700 mhz core solid.

This is the blower cooler model from saphirre. So if you are crashing Id undervolt even if only a little bit.

Some have also repasted and or tightened the screws as some heatsinks were apparently not making good contact with the gpu.
 
I'd love to see my readings stay below 100C.

As for the crashing, a year ago it happened all the time with The Division 2, and then with Red Dead Redemption 2. Issues seemed to have been resolved with drivers in games but now I'm folding and its reared its ugly head again.

I forgot to mention the above numbers are with a -5% power limit and 930 Mhz HBM down-clock; I suspect at default clock/power settings that my hot spot temps reach 110C, and that's where my crash can happen.
 
Yeah you really need to lower your voltage, not just adjust the power limit. Power limit will cap how much the card draws, but it will still be drawing too much at any given frequency because the voltage is set too high.

By lowering voltage you lower the power consumption at a given frequency. Usually with Vegas you can cut the voltage back quite a bit, although presumably there are some bad chips where that doesn't work or they wouldn't have shipped at those voltages (I wouldn't fully rule out straight up incompetence though because Vega seems severely overvolted).
 
have you under volted at all or just lowered your power limit

Just decreasing the power limit seems to be enough to make it stable, so I haven't really bothered learning how to tweak things for this card.

I did Auto Undervolt religiously but got tired of re-enabling it after a crash. It didn't seem to help much, but I don't know what the Hot Spot readings were at the time either.
 
:lol: @ Mr. Watercooled

I had another Max Fan crash this morning using settings that kept the hotspot at 104C or lower. Just limiting the voltage to 1.1V was not enough.

After a number of tweaks, this is what I ended up with, and it keeps the hotspot below 100C.

20200514_GPU_settings.png


Lowering the speed to this increases my GPU Time Per Frame (folding) by about 5%. Hopefully I'm done with crashes.
 
After a number of tweaks, this is what I ended up with, and it keeps the hotspot below 100C.

This is no longer true. I see the above numbers when folding a Pr 16435 work unit - which is the one giving me the most trouble. Hotspot temps consistently reach 104C with other folding Work Units while the system stays stable. GPU-Z reports a max voltage of 1.0875, but the large majority of the time it's below 1.07.
 
Capturing a fold of a 11748 Work Unit today at default settings - it shows 110C max (averages about 104C).

20200519_folding_GPUz.png


^ Stable.

Over Sunday night, it was folding a 16435 WU with downclocked/undervolted settings and it got a Max Fan crash. Capture of the logfile (changed header titles to tighten it up for posting; sorted largest to smallest at hotspot column):

20200517_folding_GPUz.png


^ Not stable.

So this "max fan" crash that I've been chasing appears to be not heat related at all, but a bug. Notice in the GPUz capture, it shows my SOC VRM temp reaching over 3000C :eek: Afterburner captures the same thing, but I blamed the program thinking it was bugged. I don't know of a program other than Afterburner that saves stats "real time", but only dumps what's in memory when you tell it to. Right now I don't have a way of knowing what's going on at the time of the crash, but may reinstall Afterburner if this bothers me more.

The 3000C+ reading is not limited to SOC VRM; I've also seen it in GPU VRM, Mem VRM, Power Draw (yes, it read over 3000W), and Memory Temp.

I think somehow my fan RPM reading is bleeding over to other values. I'm wondering if this is a common issue with all Vega 64 or all MSI Airboost OC Vega 64, or just my Vega 64.

My new suspicion is that sometimes these blips happen in rapid succession, causing a software panic of SHUT THIS CARD DOWN NOW with the fan immediately going to 100% and staying there indefinitely and I have to hold the power button to shut it off. It doesn't seem to occur when the card is idle (fans at min 233 RPM don't show a 233C max on other temp readings; could be that I just haven't let it idle enough though).
 
Is your HBM memory with resin or not? If you have resin it would be easier to get better temps than the 'naked' one. Used to have the naked reference 56. I have to re-seat the cooler 3 times before I get good temps on Hot Spot temps. If you seat it properly it should be about 10C difference from core temps, if not it could vary between 15-20C.

https://linustechtips.com/main/topic/825565-not-all-vega-gpus-are-made-the-same/

I learned about that in the last couple of weeks and I honestly have no idea which version I have, and I don't have the thermal grease or expertise to take my cooler off and put it back on.

1QQh4Od.gif



GPU speed fluctuates as I don't have it locked but thats what I got.

Can you set GPUz to hold the maximum (high) values, and let it run for 45 minutes or so? I'm really curious if your SOC VRM temp (or some other reading) will glitch out like mine does. 99.999% of the time mine is at 67C - but 0.001% of the time it reads over 3000C and it's easy to miss unless you set the software to hold max values.
 
I learned about that in the last couple of weeks and I honestly have no idea which version I have, and I don't have the thermal grease or expertise to take my cooler off and put it back on.



Can you set GPUz to hold the maximum (high) values, and let it run for 45 minutes or so? I'm really curious if your SOC VRM temp (or some other reading) will glitch out like mine does. 99.999% of the time mine is at 67C - but 0.001% of the time it reads over 3000C and it's easy to miss unless you set the software to hold max values.

The "naked" ones are the ones with hynix memory, the ones with Samsung memory are not naked in most cases. GPU-Z can tell you, so you can find out without taking it apart. Almost all AIB Vega 56 use hynix memory, and Vega 64 it's 50/50 it seems. I can't give you much advice on your hot spot temperatures because When I had my Vega 64 reference, I converted it to water cooling right after I bought it, and my old man memory can't remember the two weeks I ran it on air 2 years ago. It is now sitting in my closet collecting dust because I don't have another liquid cooled rig to put it in.

However, you can find just about any answer here (it will just take time to read thru all the posts):

https://www.overclock.net/forum/67-amd/1634018-official-vega-frontier-rx-vega-owners-thread.html
 
Last edited:
Mine has Samsung HBM - didn't expect to learn that my GPU is likely one with the resin.

I read through the reddit Vega Underclocking/undervolting megathread and learned a bit in that. What I've found from playing with my settings is that underclocking/undervolting does not make this a more stable folding machine. I did learn that most people up their power limit to +50% while undervolting (makes sense if the board is requiring higher amperage because you're limiting the voltage) and I thought that was the solution that I needed. I never would have tried it on my own, but as it turns out, it fixes nothing.

There is no pattern that associates my crashes with higher/lower temperatures, just with 16435 Work Units with folding, and today I had issues with another project number - my PC crashed three times in two hours. The only thing it had in common with 16435 was the checkpointing frequency was high. Usually they're 2.5% or higher, but this was 0.25% - and 16435 is 0.20%, which means it's "taking a breath" every 30 seconds as opposed to every two minutes with other projects (this also means that temps are generally lower for these projects I'm having the most trouble with). My next move is to uninstall FAH from my SSD, then install it on a platter drive and see if that does anything.

If I get time during work hours to read that thread, I will.
 
Last edited:
Mine has Samsung HBM - didn't expect to learn that my GPU is likely one with the resin.

I read through the reddit Vega Underclocking/undervolting megathread and learned a bit in that. What I've found from playing with my settings is that underclocking/undervolting does not make this a more stable folding machine. I did learn that most people up their power limit to +50% while undervolting (makes sense if the board is requiring higher amperage because you're limiting the voltage) and I thought that was the solution that I needed. I never would have tried it on my own, but as it turns out, it fixes nothing.

There is no pattern that associates my crashes with higher/lower temperatures, just with 16435 Work Units with folding, and today I had issues with another project number - my PC crashed three times in two hours. The only thing it had in common with 16435 was the checkpointing frequency was high. Usually they're 2.5% or higher, but this was 0.25% - and 16435 is 0.20%, which means it's "taking a breath" every 30 seconds as opposed to every two minutes with other projects (this also means that temps are generally lower for these projects I'm having the most trouble with). My next move is to uninstall FAH from my SSD, then install it on a platter drive and see if that does anything.

If I get time during work hours to read that thread, I will.


It could be a MB bios issue. There where some issues when the vega 64 first came out with black screens, crashes that where related to the MB bios. You may want to check into that for your MB. Also, all thought you have the minimum recommended power supply (750 watt) for a Vega 64, it could be the issue.
 
Last edited:
It could be a MB bios issue. There where some issues when the vega 64 first came out with black screens, crashes that where related to the MB bios. You may want to check into that for your MB. Also, all thought you have the minimum recommended power supply (750 watt) for a Vega 64, it could be the issue.

BIOS is updated, and my UPS reads a 515W draw at max - this includes my monitor, router, speakers, etc.
 
I learned about that in the last couple of weeks and I honestly have no idea which version I have, and I don't have the thermal grease or expertise to take my cooler off and put it back on.



Can you set GPUz to hold the maximum (high) values, and let it run for 45 minutes or so? I'm really curious if your SOC VRM temp (or some other reading) will glitch out like mine does. 99.999% of the time mine is at 67C - but 0.001% of the time it reads over 3000C and it's easy to miss unless you set the software to hold max values.


Mine will glitch but it takes a day or so of running. Its usually on the water temp reading though. Not the core/hotspot readings.
 
Back
Top