Bug 107325

Summary:

Reported temperature of nvidia card with nouveau driver is wrong

Product:

xorg

Reporter:

Jirka Novak <j.novak>

Component:

Driver/nouveau

Assignee:

Nouveau Project <nouveau>

Status:

RESOLVED MOVED

QA Contact:

Xorg Project Team <xorg-team>

Severity:

normal

Priority:

medium

CC:

florczak.raf+freedesktop, pachoramos1, rhyskidd

Version:

unspecified

Hardware:

x86-64 (AMD64)

OS:

Linux (All)

Whiteboard:

i915 platform:

i915 features:

Attachments:

Description	Flags
vbios.rom	none

Description Jirka Novak 2018-07-21 20:36:19 UTC

Hello,

  I use Dell Precision 3530 with NVIDIA Corporation GP107GLM [Quadro P600 Mobile] (rev a1). I use Fedora Core 28 with 4.17.6 x86_64 kernel.
  I found that sensors tool shows wrong temperature:

$ sensors
nouveau-pci-0100
Adapter: PCI adapter
temp1:       +511.0°C  (high = +95.0°C, hyst =  +3.0°C)
                       (crit = +105.0°C, hyst =  +5.0°C)
                       (emerg = +135.0°C, hyst =  +5.0°C)

  Temperature is obviously wrong.
  I tried to troubleshoot it on sensors side and it looks that sensors tool receives this wrong value from driver.
  I made one more observation - right after suspend/wakeup the value is completely different:

$ sensors
nouveau-pci-0100
Adapter: PCI adapter
temp1:       +511.0°C  (high = +95.0°C, hyst =  +3.0°C)
                       (crit = +105.0°C, hyst =  +5.0°C)
                       (emerg = +135.0°C, hyst =  +5.0°C)

$ sensors
nouveau-pci-0100
Adapter: PCI adapter
temp1:        +43.0°C  (high = +95.0°C, hyst =  +3.0°C)
                       (crit = +105.0°C, hyst =  +5.0°C)
                       (emerg = +135.0°C, hyst =  +5.0°C)

$ sensors
nouveau-pci-0100
Adapter: PCI adapter
temp1:       +511.0°C  (high = +95.0°C, hyst =  +3.0°C)
                       (crit = +105.0°C, hyst =  +5.0°C)
                       (emerg = +135.0°C, hyst =  +5.0°C)

  I can provide more information when required.

Comment 1 Rhys Kidd 2018-07-22 15:08:10 UTC

Thanks for the bug report Jirka,

It would be helpful if you could download and build a nouveau-related debug toolkit, envytools [0], and run the following commands (inside the nva/ subfolder):

$ ./nvapeek 0x020460

$ ./nvapeek 0x020400

It would also be helpful to see a copy of your GPU's VBIOS attached here. This can be produced by running the below command:

$ cat /sys/kernel/debug/dri/0/vbios.rom > vbios.rom


[0] https://github.com/envytools/envytools

Comment 2 Ilia Mirkin 2018-07-22 15:46:19 UTC

Perhaps when it's runtime-suspended, the readings return all 1's, and we report 511 (0x1ff). Or some variation thereof.

Jirka - if you boot with nouveau.runpm=0, I suspect the temperature will be fine -- good to check though.

Comment 3 Jirka Novak 2018-07-22 19:12:19 UTC

Created attachment 140773 [details]
vbios.rom

Hi,

> It would be helpful if you could download and build a nouveau-related debug
> toolkit, envytools [0], and run the following commands (inside the nva/
> subfolder):
> 
> $ ./nvapeek 0x020460
> 
> $ ./nvapeek 0x020400

Output is there, but I see different output for subsequent calls:

# nvapeek 0x020460
00020460: 20003170
# nvapeek 0x020460
00020460: 20003180
# nvapeek 0x020460
00020460: 200031a8
# nvapeek 0x020460
00020460: 200031a0
# nvapeek 0x020460
00020460: 200031e8

# nvapeek 0x020400
00020400: 00000030
# nvapeek 0x020400
00020400: 00000031
# nvapeek 0x020400
00020400: 00000031
# nvapeek 0x020400
00020400: 00000031
# nvapeek 0x020400
00020400: 00000031
# nvapeek 0x020400
00020400: 00000032

> It would also be helpful to see a copy of your GPU's VBIOS attached here. This
> can be produced by running the below command:
> 
> $ cat /sys/kernel/debug/dri/0/vbios.rom > vbios.rom

File is attached.

					Best regards,

							Jirka Novak

Comment 4 Jirka Novak 2018-07-22 19:14:07 UTC

Hi,

> Perhaps when it's runtime-suspended, the readings return all 1's, and we report
> 511 (0x1ff). Or some variation thereof.
> 
> Jirka - if you boot with nouveau.runpm=0, I suspect the temperature will be
> fine -- good to check though.

yes, you are correct. It then returns 49-50 degrees.

						Best regards,

							Jirka Novak

Comment 5 Pacho Ramos 2019-03-31 11:09:19 UTC

I have the same issue with kernel 4.19.30 still

Comment 6 Karol Herbst 2019-04-10 14:31:02 UTC

this is a runtime suspend issue. While the GPU is suspended the temperature reading fails, but we don't actually check for that, so we return the error value (-1 & 0x1ff = 511).

I think I had a patch for that somewhere, let me see.

Comment 7 Roy 2019-04-10 14:34:26 UTC

Will this bug interact with Lyude's recent patch, "drm/nouveau/i2c: Disable i2c bus access after ->fini()"?

Comment 8 Karol Herbst 2019-04-11 00:46:45 UTC

no. This was mainly for displays afaik and we read out the temperature through MMIO.

Comment 9 Martin Peres 2019-12-04 09:44:03 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/xorg/driver/xf86-video-nouveau/issues/445.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.