RX Vega 64 Hot Spot Temperature

Heres an example of a small washer mod that got junction temp a long way down:

https://www.reddit.com/r/Amd/comments/go0kny/public_service_announcement_for_the_owners_of/

Its basically the same as the tighten the screws on the gpu hs a bit mod...

Hi everyone, I'm writing this post for the few owners of this card: the mounting system for the heatsink is trash.

TLDR:

Before :73C edge 103 Junction

3 cents and 5 minutes after: 73C edge and 83 Junction



I was playing with a few stress test and i noticed that i always had a huge difference between edge temperature and juction temperature. Edge was at around 75C and junction was at 105C, 30C of delta seemed way too much and it severly limited my overclock, I even reapplied the thermal paste two times using noctua nt-h1 but nothing changed, then user u/f-ben (Thanks again!) suggested to not use the blob method, so I spread the paste on the die to ensure even coverage. This helped a little with temps, now at 73C and 103C, but the delta was still at 30C

The system that makes the heatsink apply pressure on the die consists of four spring loaded screws, but the thing is that the screw threads bottom out way before the springs can apply the proper mounting pressure. I started to think if maybe the washer mod used for the reference design could work on this card. I bought for 70 cents a pack of 100 3mm washers, about 0.7mm thick. I put one in each of the four recesses on the backplate and tightened the screws with the usual cross pattern, whitout even reapplying the thermal paste. The results are way better than i expected (I didn't even think this could do anything significant, i though i just had bad luck with the silicon lottery) .

Edge temperature has not changed and sits at 73C but the junction is now a massive 20C lower at 83C, of course every test was done with the exact same settings and fixed fan speed.
 
Unfortunately I don't have the thermal grease or expertise to take my card apart and put it back together. It is interesting that a design flaw like that could exist, and I wouldn't put it past my card to have something similar to that 5700XT.

The pattern behind my crashes don't follow times of high heat. In fact, stability seems to increase when I decrease air flow and increase the temps. I haven't had a crash during a game in a very long time - probably since last year. It's only when folding.

@ Flyordie: Thanks for looking at it. Mine glitches 1 - 20 times per ten minute interval. I also found that it glitches when not under load as well, but haven't had a max fan crash while idling.

I haven't put the folding work on my platter drive yet. It's been pretty stable today. Not a single 16435 WU but did get a BSOD on a 13851 (haven't seen that before). It recovered and finished.
 
I got onto a chat with MSI support. I told them my issue.

Video card crashes with fans kicking instantly to 100%, seemingly regardless of temperature but always when under load.

Their response?

You need to RMA it. No "Have you tried X, Y, Z?" No "Is it caked with dust?" No "What type of PSU do you have?" No questions at all. Here's the RMA link. Your card is bad.

My worry now is that if I RMA it, they not be able to replicate my issue and send it back to me as is, unnecessarily costing me shipping $$$ plus time without playing s00per 1nt3ns3 gamez. But if I can get a new card out of it that doesn't crash HOW WONDERFUL that would be awesome for my sanity.
 
I got onto a chat with MSI support. I told them my issue.

Video card crashes with fans kicking instantly to 100%, seemingly regardless of temperature but always when under load.

Their response?

You need to RMA it. No "Have you tried X, Y, Z?" No "Is it caked with dust?" No "What type of PSU do you have?" No questions at all. Here's the RMA link. Your card is bad.

My worry now is that if I RMA it, they not be able to replicate my issue and send it back to me as is, unnecessarily costing me shipping $$$ plus time without playing s00per 1nt3ns3 gamez. But if I can get a new card out of it that doesn't crash HOW WONDERFUL that would be awesome for my sanity.
Honestly, I feel like that's the way support has been working lately. No real attempt to troubleshoot. Just push you towards RMA.

I always spend like an hour writing up a very descriptive message about the issue I'm having and try to ask questions about what could be the problem and how I could fix it. Then two days later, they have a basic message about trying to RMA it, and don't even bother to address anything I've said.

And the RMA process is frustrating because you're usually out of the shipping cost. The problem isn't your fault. It's like a $10-$20 extra fee on top of your purchase. And, at least for me, I usually look for a deal and specifically bought the product at a price I wanted. Not $20 more.
 
I was surprised as well. I think because people have to pay for shipping, most won't even bother RMAing a sub-$200 card. Saves the company money in repair technicians, back-to-customer shipping costs (though large businesses get like 90% discounts - when I worked for L3 I used it all the time), and time wasted with the help desk by just telling the customer to RMA. It cost me $30 to ship this one for 5 day ground.

They say a 15 - 35 day turnaround for video cards, but I see threads in Reddit that show 1 - 3 day turnaround is quite common. Some people report free upgrades because the parts are not in stock to repair the current card.

Looking forward to my 5700XT Evoke next week. :bleh:
 
I hope you guys don't mind me turning this thread in this mostly dead section into a "blog" of sorts on my experience with RMAing with MSI.

I requested an RMA - one of the options as an issue was "BSOD". While I had other issues with the card, I often did get BSOD so I picked that option. Submitted the RMA request and waited my 48 hours for feedback. Got nothing.

Waited an additional 24 hours, and still nothing.

Submitted a 2nd RMA request but picked "Other" and described my GPU crashing while kicking the fans on to 100%. Within 10 minutes, got an RMA number - it was enough 'automation' that I thought I must have done something wrong on my first try.

I removed the card and put my 8yr old 7970 in its place. Used it for a day to make sure it was good. Sent the Vega 64 out 32 hours after getting the RMA number.

Eight days from the initial request, I get an approved RMA number for it. I ignore the new RMA#, and will continue to do so.

MSI says they have received it.

rma_has_been_arrived2.png


MSI says they are working on it.

rma_has_been_arrived3.png


Please fix this thing guys.
 
rma_has_been_shipped.png


Seems like good news, except no tracking information:

rma_has_been_shipped_no_track.png

(RMA number and ship time edited because it seems like the right thing to do in this day of bots)

Also no feedback. No emails. Nothing stating they found something or they did not find something. So far, only about 50% as bad as the experience could possibly be. If they actually fixed the card, everything is roses. If the card is not fixed, I am boycotting MSI.

Late edit: have tracking info now. Should arrive Tues.
 
Last edited:
7970 basically folded the entire time the Vega was gone. Not a single crash.

Vega arrived this morning.

Packed neatly in a foam cut out box. They then put that box in a larger box with no packaging material so it could continuously slide around on its voyage across the country.

Did all necessary steps to uninstall drivers, put Vega back in, reinstalled drivers. Ran 3DMark. Not as high scoring as previous runs but new drivers whatever.

Left all settings at default.

Started folding. Two "fan at 100%" crashes in 5 hours.

**** this card. **** MSI.

Very late edit: I edited two days ago to take the above statement back. It was premature. New PSU did not fix my issue.
 
Last edited:
7970 basically folded the entire time the Vega was gone. Not a single crash.

Vega arrived this morning.

Packed neatly in a foam cut out box. They then put that box in a larger box with no packaging material so it could continuously slide around on its voyage across the country.

Did all necessary steps to uninstall drivers, put Vega back in, reinstalled drivers. Ran 3DMark. Not as high scoring as previous runs but new drivers whatever.

Left all settings at default.

Started folding. Two "fan at 100%" crashes in 5 hours.

**** this card. **** MSI.
What a sham(e)

:(

Fargin MSI indeed >:E
 
What a sham(e)

:(

Fargin MSI indeed >:E

This is the 2nd GPU I've had with their brand on it and the 2nd GPU that has been jacked up in my lifetime of owning GPUs. Previous was MSI X800 Pro with artifacting straight out of the box.

I have an MSI laptop. It works great. I have had three MSI motherboards. They are great as well. No more MSI GPUs for me.
 
This is the 2nd GPU I've had with their brand on it and the 2nd GPU that has been jacked up in my lifetime of owning GPUs. Previous was MSI X800 Pro with artifacting straight out of the box.

I have an MSI laptop. It works great. I have had three MSI motherboards. They are great as well. No more MSI GPUs for me.

This really sucks. Do you have any other recourse to get the card fixed?
 
This really sucks. Do you have any other recourse to get the card fixed?

I've read a few threads/reddit posts that say after you've RMA'd the same card three times they replace it, or you get a credit or refund for the purchase price. For me that would be $280 because I redeemed the three games from the promotion at the time, effectively subtracting $150 from my $430 purchase price. Considering 3 shipping purchases at $30, I come out with $190 and ~three months of using my 7970 instead (which is on its last leg).

Other than that, my only recourse is to get screwed.

Yesterday I clicked the "Request Service" button. In it I explained the card returned from RMA and had the same issue. Explicitly said "I am hoping for another solution other than RMAing the card again.". Their response? An RMA ticket. Pretty sure I'm being trolled.
 
How much power is it drawing/how hot is it getting when folding? It sucks that there seems to be something wrong with the card, but maybe it's worth trying to get to the bottom of what the problem is.

You could install MSI Afterburner and trying bumping the core and memory clock down a little, and/or reduce the power limit (do them one at a time), and see if you can figure out what the issue is. That may help get it resolved, or at least provide a work around.

Anyway, this sort of thing really sucks. I had a Gigabyte X370 motherboard that suffered from the "soft brick" bug (basically sometimes when powered off the board just wouldn't boot as if it were dead, but pulling the battery fixed it). It was so bad that it was happening multiple times every week, so I tried to RMA it but it wasn't fixed. I eventually just gave the board away to a friend as there wasn't any way to get it fixed properly, and I couldn't sell it to anyone in good conscience with that problem. You definitely have my sympathies dealing with a troublesome RMA situation. :(
 
How much power is it drawing/how hot is it getting when folding? It sucks that there seems to be something wrong with the card, but maybe it's worth trying to get to the bottom of what the problem is.

You could install MSI Afterburner and trying bumping the core and memory clock down a little, and/or reduce the power limit (do them one at a time), and see if you can figure out what the issue is. That may help get it resolved, or at least provide a work around.

Anyway, this sort of thing really sucks. I had a Gigabyte X370 motherboard that suffered from the "soft brick" bug (basically sometimes when powered off the board just wouldn't boot as if it were dead, but pulling the battery fixed it). It was so bad that it was happening multiple times every week, so I tried to RMA it but it wasn't fixed. I eventually just gave the board away to a friend as there wasn't any way to get it fixed properly, and I couldn't sell it to anyone in good conscience with that problem. You definitely have my sympathies dealing with a troublesome RMA situation. :(

I've changed numerous settings, and different WUs cause different power load. For instance, the one that I'm crunching now on Power Saver preset was resumed from a crash overnight. It's drawing a meager 115W at peak.

I've fiddled with core voltages, clock settings, and aggressive fan ramp-up. None of it seems to matter.

The one time (before RMA) that it ran for days without a crash was when I used the Turbo Preset (Vega has Power Saver, Balanced, and Turbo Presets in addition to the Auto/Manual tabs) (screenshot), at 82C average GPU temp and average 240W power draw. But then randomly went back to crashing every few hours.

I ran/folded with my HD 7970 at (guestimate) 180W continuously for over a month without a single crash. It was awesome in that respect.

I'm tempted to pick up a $280 5600XT at a ~15% performance loss and be done with it.
 
Likewise, sorry to hear about your motherboard situation. So far that's a score of 0/2 for anecdotal computer component RMA experiences. My previous motherboard was a Gigabyte UD3R X58. Memory controller failed, but it was 13 years old.

I don't know if I could stomach RMAing a motherboard on my primary PC.
 
The Vega 64 outside of the stock settings, and specially when under volting/manual overclocking/under clocking has/had issue with clock spikes that would hard lock the machine because of how it controls clock speeds based on temperature/voltage/heat and head room in combination to all those. Since mine was reference aircooled, converted to water, when I overclocked it, I had issues constantly with this, but only in a couple particular games. It took me a lot of time to figure out what was causing it and find setting that wouldn't spike. Example, I could not set my max core clock speed higher than 1680 otherwise it would spike to 1800+ and hard lock (AMD liquid cooled come set default at 1750, air cooled 1560). Since my Vega was not stable over 1720, any spikes above that would cause issues, and undervaluing made it worse. The reason is how AMD's clock speed algorithm works in conjunction with the temperature/voltage/heat as I mentioned above.

Also, I know you where originally concerned about the hot spot temp.. IIRC, the max hot spot temp for Vega 64 air cooled is 115 C before it throttles due to the Hot spot temperature. There is this little voice in the back of my head that keeps trying to tell me it's 130C, but I like to ignore that voice..lol (I could very well be 130C, I don't fully remember)



Anyhow, it could very well be your clock speeds spiking without you realizing it.


One game that seemed to draw this issue out believe it or not was 7 days 2 die. I could play nearly every game for hours (10+ hours battlefield 1/ Battlefield 5)and never have an issuees But with 7 days 2 die, due to the gpu clock spike (causing by my fiddling with core clocks, voltages, undervolting, etc) it would hard lock either 5 minutes into the game or 2 hours into the game. Many times it was when I paused the game to grab something to eat or drink.. and I would come back to a locked up machine.. I had to start logging gpu clock speeds with MSI afterburning, to notice the clock spikes, and then do a lot of reading to try and understand why.. and I still really don't fully understand how AMD's Algorithm works, hence why I won't even attempt to try to explain it to you. Power limits play roll in this as well.
 
Last edited:
This card is supposed to down throttle the clock when it gets to 75C, and for the most part, it's successful. Right now it's at 1530 Mhz and drawing 170W (at 75C).

Here's an old log capture from GPUz when playing a $2 shooter on Steam.
Code:
        Date        	 GPU Clock [MHz] 	 Memory Clock [MHz] 	 GPU Temperature [°C] 	 GPU Temperature (Hot Spot) [°C] 
5/15/2020 23:40	1613	920	76	93
5/15/2020 23:40	1598	919	76	93
5/15/2020 23:40	1548	924	76	91
5/15/2020 23:40	1526	917	76	93
5/15/2020 23:40	1502	920	76	93
5/15/2020 23:40	1599	919	77	93
5/15/2020 23:40	1599	919	77	93
5/15/2020 23:40	1557	833	76	90
5/15/2020 23:40	1557	833	76	85
5/15/2020 23:40	1609	741	76	90
5/15/2020 23:40	1609	741	76	90
5/15/2020 23:40	1516	819	76	88
5/15/2020 23:40	1516	819	76	90
5/15/2020 23:40	1613	882	76	87
5/15/2020 23:40	1412	870	76	97
5/15/2020 23:40	1459	855	77	99
5/15/2020 23:40	1431	871	78	101
5/15/2020 23:40	1431	871	78	101
5/15/2020 23:40	1514	842	78	102
5/15/2020 23:40	26	836	78	96
5/15/2020 23:40	26	924	78	96
5/15/2020 23:40	1563	926	77	95
5/15/2020 23:40	1596	919	78	94
5/15/2020 23:40	1545	908	78	95
5/15/2020 23:40	1617	910	78	95
5/15/2020 23:40	1582	923	78	94
5/15/2020 23:40	1617	921	78	94
5/15/2020 23:40	1595	903	77	95
5/15/2020 23:41	1600	915	77	94
5/15/2020 23:41	1561	920	78	95
5/15/2020 23:41	1614	923	77	100
5/15/2020 23:41	1407	845	78	102
5/15/2020 23:41	1539	845	78	102
5/15/2020 23:41	1475	844	78	102
5/15/2020 23:41	1418	811	79	102
5/15/2020 23:41	1469	825	80	103
5/15/2020 23:41	1449	827	80	104
5/15/2020 23:41	1437	827	80	104
5/15/2020 23:41	1571	824	80	104
5/15/2020 23:41	1462	827	80	104
5/15/2020 23:41	1456	825	81	104
5/15/2020 23:41	1446	814	81	104
5/15/2020 23:41	1427	827	81	104
5/15/2020 23:41	1429	840	82	104
5/15/2020 23:41	1431	831	82	104
5/15/2020 23:41	1568	839	81	105
5/15/2020 23:41	1445	832	82	105
5/15/2020 23:41	1441	841	81	104
5/15/2020 23:41	1432	840	81	104
5/15/2020 23:41	1465	804	82	103
5/15/2020 23:41	1459	801	81	104
5/15/2020 23:41	1431	800	81	104
5/15/2020 23:41	1442	800	82	104
5/15/2020 23:41	1388	800	82	105
5/15/2020 23:41	1400	819	82	104
5/15/2020 23:41	1428	800	82	104
5/15/2020 23:41	1417	800	82	104
5/15/2020 23:41	1434	800	82	104
5/15/2020 23:41	1472	801	82	106
5/15/2020 23:41	1517	857	82	104
5/15/2020 23:41	1518	844	82	104
5/15/2020 23:41	1454	800	81	105
5/15/2020 23:41	1584	918	83	104
5/15/2020 23:41	1526	895	82	105
5/15/2020 23:41	1378	882	82	104
5/15/2020 23:41	1398	807	81	105
5/15/2020 23:41	1404	800	81	105
5/15/2020 23:41	1433	804	82	104
5/15/2020 23:41	1498	830	82	104
5/15/2020 23:41	1444	834	82	106
5/15/2020 23:41	1526	837	82	104
5/15/2020 23:41	1522	844	82	105
5/15/2020 23:41	1443	811	82	106
5/15/2020 23:41	1416	830	82	104
5/15/2020 23:41	1463	800	82	105
5/15/2020 23:41	1427	815	82	104
5/15/2020 23:41	1430	802	81	103
5/15/2020 23:41	1555	805	82	107
5/15/2020 23:41	1417	846	82	104
5/15/2020 23:41	1443	800	81	104
5/15/2020 23:41	1439	800	82	105
5/15/2020 23:41	1438	804	82	104
5/15/2020 23:41	1465	802	83	104
5/15/2020 23:41	1444	800	82	104
5/15/2020 23:41	1392	800	82	105
5/15/2020 23:41	1465	812	82	107
5/15/2020 23:41	1453	839	82	105
5/15/2020 23:41	1439	848	82	104
5/15/2020 23:42	1439	800	82	105
5/15/2020 23:42	1420	809	82	104
5/15/2020 23:42	1430	801	82	104
5/15/2020 23:42	1442	799	82	104
5/15/2020 23:42	1431	800	82	104
5/15/2020 23:42	1419	800	82	106
5/15/2020 23:42	1431	840	82	104
5/15/2020 23:42	1438	800	82	105
5/15/2020 23:42	1540	828	82	105
5/15/2020 23:42	1395	809	83	104
5/15/2020 23:42	1451	801	82	106
5/15/2020 23:42	1451	853	82	104
5/15/2020 23:42	1463	800	81	103
5/15/2020 23:42	1464	856	82	104
5/15/2020 23:42	1460	801	82	104
5/15/2020 23:42	1442	800	81	105
5/15/2020 23:42	1398	825	82	104
5/15/2020 23:42	1444	800	82	104
5/15/2020 23:42	1436	800	81	105
5/15/2020 23:42	1423	809	82	104
5/15/2020 23:42	1442	800	81	106
5/15/2020 23:42	1408	836	82	104
5/15/2020 23:42	1397	802	82	104
5/15/2020 23:42	1441	800	82	104
5/15/2020 23:42	1449	800	82	104
5/15/2020 23:42	1441	800	81	104
5/15/2020 23:42	1440	825	81	106
5/15/2020 23:42	1423	800	82	105
5/15/2020 23:42	1483	817	82	105
5/15/2020 23:42	1425	800	81	105
5/15/2020 23:42	1506	800	83	104
5/15/2020 23:42	1444	800	82	104
5/15/2020 23:42	1450	800	82	105
5/15/2020 23:42	1422	845	82	105
5/15/2020 23:42	1438	803	82	102
5/15/2020 23:42	26	721	81	98
5/15/2020 23:42	1548	923	80	96
5/15/2020 23:42	1524	909	80	97
5/15/2020 23:42	1612	922	80	98
5/15/2020 23:42	1595	928	80	96
5/15/2020 23:42	1547	911	79	97
5/15/2020 23:42	1599	922	79	97

As you can see, temps got pretty high and the clock speeds were kept down. To be fair to your point, I did not experience any crashes while playing this.

Seems a buggy algorithm causing a clock spike would be a BIOS issue and fixed easily. That said, being part of a 0.2% - 0.5% market share has its downsides.
 
So... I found out by pure accident that the strange switch on the side of my card is a BIOS switch. There's very little official information about it... Primary mode is full power, Secondary mode is reduced.

In an attempt to solve my crashing issues and save $280, I decided to try switching the switch. PC crashed, I powered it down, took case apart, switched the switch, powered the PC on. Three long beeps and no video. Powered PC back off, switched back the switch, and it boots up fine.

Is there anything special I need to do to try out Secondary mode, or is my card more dorked up than I previously thought?
 
So... I found out by pure accident that the strange switch on the side of my card is a BIOS switch. There's very little official information about it... Primary mode is full power, Secondary mode is reduced.

In an attempt to solve my crashing issues and save $280, I decided to try switching the switch. PC crashed, I powered it down, took case apart, switched the switch, powered the PC on. Three long beeps and no video. Powered PC back off, switched back the switch, and it boots up fine.

Is there anything special I need to do to try out Secondary mode, or is my card more dorked up than I previously thought?
You done fugged up poking the bios switch with the rig on. You're only supposed to do that with the computer off. Might have corrupted the bios when switching it with the pc on.

That's just me adding a bit of common sense to the mix so i may be wrong but my thought process usually checks out on things like this.

If it's corrupted there may be a way to flash the bios so it works again on the proper setting.
 
Please tell me you didn't flip the BIOS switch with the PC running.

My God what would possess you to think that's a good idea? :lol:
 
Back
Top