Bug 106548 - Failed GfxDrv_DriverAcceptanceQuery.GL_GPU_FREQ_OVERRIDE_MDAPI
Summary: Failed GfxDrv_DriverAcceptanceQuery.GL_GPU_FREQ_OVERRIDE_MDAPI
Status: RESOLVED MOVED
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/DRI/i915 (show other bugs)
Version: unspecified
Hardware: Other All
: high normal
Assignee: Default DRI bug account
QA Contact: Default DRI bug account
URL:
Whiteboard:
Keywords:
: 106549 (view as bug list)
Depends on:
Blocks:
 
Reported: 2018-05-17 08:27 UTC by Kishore
Modified: 2019-09-18 19:41 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
GL_GPU_FREQ_OVERRIDE_MDAPI Acceptance failure (648.28 KB, text/plain)
2018-05-17 08:27 UTC, Kishore
Details
DMESG output (60.27 KB, text/plain)
2018-05-24 18:11 UTC, Francesco Balestrieri
Details

Description Kishore 2018-05-17 08:27:58 UTC
Created attachment 139610 [details]
GL_GPU_FREQ_OVERRIDE_MDAPI Acceptance failure

On Ubuntu 18.04, with latest MESA + MDAPI from here
https://gerrit-gfx.intel.com/#/admin/projects/gfx/core/metrics-discovery

The testacceptance is failing for GL_GPU_FREQ_OVERRIDE_MDAPI
Comment 1 Chris Wilson 2018-05-17 10:12:45 UTC
*** Bug 106549 has been marked as a duplicate of this bug. ***
Comment 2 Chris Wilson 2018-05-17 10:13:04 UTC
/home/kk/workspace/gpa_extensions/src/GfxDrvDriverAcceptanceTest/test_GfxDrv_DriverAcceptanceQuery.cpp:1438: Failure
Value of: pOverride->SetOverride(&freqParam, sizeof(freqParam))
  Actual: 43
Expected: CC_OK
Which is: 0

Looks like the test case needs improving for starters.
Comment 3 Vyacheslav Bogdanov 2018-05-17 10:35:41 UTC
(In reply to Chris Wilson from comment #2)
> /home/kk/workspace/gpa_extensions/src/GfxDrvDriverAcceptanceTest/
> test_GfxDrv_DriverAcceptanceQuery.cpp:1438: Failure
> Value of: pOverride->SetOverride(&freqParam, sizeof(freqParam))
>   Actual: 43
> Expected: CC_OK
> Which is: 0
> 
> Looks like the test case needs improving for starters.

Could you please explain? The function call fails, test checks that. Syslog typically provides more logs that will help understanding the cause of the failure.
Comment 4 Joonas Lahtinen 2018-05-17 14:39:27 UTC
What exactly is being reported here? If there's a regression after a kernel update, please provide a bisect between last known good and drm-tip.

There's nothing MDAPI related in the kernel and you're not specifying any link to the testing suite, so this bug report is hardly useful.

Providing a dmesg with debug options from running just the failing subtest would be a good start to guess something.

For proper bug reporting, please see:

https://01.org/linuxgraphics/documentation/how-report-bugs
Comment 5 Francesco Balestrieri 2018-05-19 05:20:16 UTC
Additional comment from Kishore in the duplicate bug #106549:

"I have tried on default kernel version on ubuntu 18.04 is 4.15-rc20.
and also on the drm-tip, i can see the failure"
Comment 6 Francesco Balestrieri 2018-05-21 08:50:41 UTC
Kishore, could you point to an easy way to reproduce the failure without running the full test suite, or at least which MDAPI is called when this happens and it source code. Also, please include dmesg output showing the kernel failure (make sure to enable debug information with drm.debug=0x1e log_buf_len=4M
Comment 7 Francesco Balestrieri 2018-05-24 18:11:04 UTC
Created attachment 139741 [details]
DMESG output

Kernel logs
Comment 8 Francesco Balestrieri 2018-05-25 07:50:09 UTC
I attached the log I got from Kishore. I wasn't able to spot any kernel failure though.
Comment 9 Chris Wilson 2018-05-25 07:55:35 UTC
There's no user API that corresponds to a naive interpretation of "GL_GPU_FREQ_OVERRIDE_MDAPI" (I assume that means to set the gpu frequency). There is a root-only interface to set the global frequency limits and a proposed CONTEXT_SETPARAM to set per-context frequency requests, but that is pending review and has no userspace.

Hence, my request for the test case to be improved to report what it actually tried and what actually happened, because at the moment this bug report is not actionable.
Comment 10 Francesco Balestrieri 2018-05-25 13:46:42 UTC
Indeed, after some more investigation I found out that the test case uses (via MDAPI) the root-only interface:

/sys/devices/pci0000:00/0000:00:02.0/drm/card0/gt_boost_freq_mhz
/sys/devices/pci0000:00/0000:00:02.0/drm/card0/gt_max_freq_mhz
/sys/devices/pci0000:00/0000:00:02.0/drm/card0/gt_min_freq_mhz 

to ask the kernel to maintain a static GPU frequency, and then reads it back to check that it stays within a range of the requested value. The test fails because the value it reads back is not what it requested (reproduced with drm-tip). I still don't have dmesg output with debug logs enabled but I'l try to get it next week.

For the records, this is what the test prints out once it is run as root:

[ RUN      ] GfxDrv_DriverAcceptanceQuery.GL_GPU_FREQ_OVERRIDE_MDAPI
AvgGpuCoreFrequencyMHz    =        947   ( expected in range [855, 950] )

AvgGpuCoreFrequencyMHz    =        948   ( expected in range [855, 950] )

AvgGpuCoreFrequencyMHz    =        948   ( expected in range [855, 950] )

AvgGpuCoreFrequencyMHz    =        947   ( expected in range [855, 950] )

AvgGpuCoreFrequencyMHz    =        948   ( expected in range [855, 950] )

AvgGpuCoreFrequencyMHz    =        947   ( expected in range [855, 950] )

AvgGpuCoreFrequencyMHz    =        948   ( expected in range [855, 950] )

AvgGpuCoreFrequencyMHz    =        947   ( expected in range [855, 950] )

AvgGpuCoreFrequencyMHz    =        947   ( expected in range [855, 950] )

AvgGpuCoreFrequencyMHz    =        797   ( expected in range [855, 950] ) => OUT OF RANGE!
Comment 11 Chris Wilson 2018-05-25 14:05:10 UTC
An idle gpu is always reported as running at the min(idle) freq. You have to be very careful in the workload you construct that it does indeed keep the gpu busy if you want to use a sampling method.

No sign of a bug yet...
Comment 12 Francesco Balestrieri 2018-05-25 15:30:52 UTC
(In reply to Chris Wilson from comment #11)
> An idle gpu is always reported as running at the min(idle) freq. You have to
> be very careful in the workload you construct that it does indeed keep the
> gpu busy if you want to use a sampling method.
> 

Apparently the test case uses queries, not the sampled method of metrics measurement.

> No sign of a bug yet...

No disagreement there, however any further suggestion as to what to look at to figure out what is going wrong is welcome.
Comment 13 Chris Wilson 2018-05-25 15:49:59 UTC
Frequency changes are accompanied by a tracepoint (though note we only say what we ask the hw to do, the hw is at liberty to do whatever it wants), and you can enable the tracepoints for batch submission, so if you have the time and patience to correlate them with the test you can draw pretty a graph. For something along those lines, intel-gpu-overlay.
Comment 14 GitLab Migration User 2019-09-18 19:41:07 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/784.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.