Bug 111080 - Random crash on amdgpu due to temperature missrepoorting
Summary: Random crash on amdgpu due to temperature missrepoorting
Status: RESOLVED MOVED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/AMDgpu (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium major
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-07-07 15:53 UTC by Michel
Modified: 2019-09-19 03:24 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
amdgpu_pm_info information from start of game to crash (378.32 KB, text/plain)
2019-07-07 15:53 UTC, Michel
no flags Details

Description Michel 2019-07-07 15:53:57 UTC
Created attachment 144716 [details]
amdgpu_pm_info information from start of game to crash

Hi, 

I have been experiencing some random crash in dota 2 for the past 2 years. 
Changed everything in the computer 6900k -> threadripper, corsaire memory -> gskill, radeon frontier -> radeon vega 7. Ubuntu 16.04 ->16.10 -> 17.04 -> 17.10 ->18.04 ->18.10 ->19.04.  This is with all the mesa version in between currently on 
"OpenGL renderer string: AMD Radeon VII (VEGA20, DRM 3.32.0, 5.2.0-rc7+, LLVM 9.0.0)
OpenGL core profile version string: 4.5 (Core Profile) Mesa 19.2.0-devel - padoka PPA
OpenGL core profile shading language version string: 4.50
OpenGL core profile context flags: (none)
OpenGL core profile profile mask: core profile
"
All experience the same random crash. 

I finally got on lead on the problem seeing the GPU reporting unrealistic values, ex: MHZ jump to 10 000 range. Around the time of the crash temperature in the logs goes from  62c to 500c within two seconds back to 62c. This I suspect would cause the GPU to apply its protection and freeze and if it was true, also violate some law of physics.

Most other tool I use to test the grapgic card, example Uningine, reports correct values within the supported range defined for the cards which are

"
#OD_VDDC_CURVE:
#0:        808Mhz        704mV
#1:       1304Mhz        777mV
#2:       1801Mhz       1054mV
#OD_RANGE:
#SCLK:     808Mhz       2200Mhz
#MCLK:     351Mhz       1200Mhz
"

Attached is an example generated with 
"watch -t -n1 'cat /sys/kernel/debug/dri/1/amdgpu_pm_info|grep -A 9 "GFX Clocks" | tee -a /home/mitch/tmp/gpulog.txt'"

Example grep Temp
"
GPU Temperature: 70 C
GPU Temperature: 511 C
GPU Temperature: 69 C
"

grep \(SLCK
"
	1924 MHz (SCLK)
	5422 MHz (SCLK)
	1999 MHz (SCLK)
"



I realize the issue might be somewhere else than the mesa driver but would like to know where this could be and if anybody else seen this kind of behaviour

Thank you very much for any help
Comment 1 GitLab Migration User 2019-09-18 20:29:16 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1044.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.