Bug 111080

Summary: Random crash on amdgpu due to temperature missrepoorting
Product: DRI Reporter: Michel <timitch_1>
Component: DRM/AMDgpuAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED MOVED QA Contact:
Severity: major    
Priority: medium    
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
amdgpu_pm_info information from start of game to crash none

Description Michel 2019-07-07 15:53:57 UTC
Created attachment 144716 [details]
amdgpu_pm_info information from start of game to crash

Hi, 

I have been experiencing some random crash in dota 2 for the past 2 years. 
Changed everything in the computer 6900k -> threadripper, corsaire memory -> gskill, radeon frontier -> radeon vega 7. Ubuntu 16.04 ->16.10 -> 17.04 -> 17.10 ->18.04 ->18.10 ->19.04.  This is with all the mesa version in between currently on 
"OpenGL renderer string: AMD Radeon VII (VEGA20, DRM 3.32.0, 5.2.0-rc7+, LLVM 9.0.0)
OpenGL core profile version string: 4.5 (Core Profile) Mesa 19.2.0-devel - padoka PPA
OpenGL core profile shading language version string: 4.50
OpenGL core profile context flags: (none)
OpenGL core profile profile mask: core profile
"
All experience the same random crash. 

I finally got on lead on the problem seeing the GPU reporting unrealistic values, ex: MHZ jump to 10 000 range. Around the time of the crash temperature in the logs goes from  62c to 500c within two seconds back to 62c. This I suspect would cause the GPU to apply its protection and freeze and if it was true, also violate some law of physics.

Most other tool I use to test the grapgic card, example Uningine, reports correct values within the supported range defined for the cards which are

"
#OD_VDDC_CURVE:
#0:        808Mhz        704mV
#1:       1304Mhz        777mV
#2:       1801Mhz       1054mV
#OD_RANGE:
#SCLK:     808Mhz       2200Mhz
#MCLK:     351Mhz       1200Mhz
"

Attached is an example generated with 
"watch -t -n1 'cat /sys/kernel/debug/dri/1/amdgpu_pm_info|grep -A 9 "GFX Clocks" | tee -a /home/mitch/tmp/gpulog.txt'"

Example grep Temp
"
GPU Temperature: 70 C
GPU Temperature: 511 C
GPU Temperature: 69 C
"

grep \(SLCK
"
	1924 MHz (SCLK)
	5422 MHz (SCLK)
	1999 MHz (SCLK)
"



I realize the issue might be somewhere else than the mesa driver but would like to know where this could be and if anybody else seen this kind of behaviour

Thank you very much for any help
Comment 1 GitLab Migration User 2019-09-18 20:29:16 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1044.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.