Created attachment 139610 [details] GL_GPU_FREQ_OVERRIDE_MDAPI Acceptance failure On Ubuntu 18.04, with latest MESA + MDAPI from here https://gerrit-gfx.intel.com/#/admin/projects/gfx/core/metrics-discovery The testacceptance is failing for GL_GPU_FREQ_OVERRIDE_MDAPI
*** Bug 106549 has been marked as a duplicate of this bug. ***
/home/kk/workspace/gpa_extensions/src/GfxDrvDriverAcceptanceTest/test_GfxDrv_DriverAcceptanceQuery.cpp:1438: Failure Value of: pOverride->SetOverride(&freqParam, sizeof(freqParam)) Actual: 43 Expected: CC_OK Which is: 0 Looks like the test case needs improving for starters.
(In reply to Chris Wilson from comment #2) > /home/kk/workspace/gpa_extensions/src/GfxDrvDriverAcceptanceTest/ > test_GfxDrv_DriverAcceptanceQuery.cpp:1438: Failure > Value of: pOverride->SetOverride(&freqParam, sizeof(freqParam)) > Actual: 43 > Expected: CC_OK > Which is: 0 > > Looks like the test case needs improving for starters. Could you please explain? The function call fails, test checks that. Syslog typically provides more logs that will help understanding the cause of the failure.
What exactly is being reported here? If there's a regression after a kernel update, please provide a bisect between last known good and drm-tip. There's nothing MDAPI related in the kernel and you're not specifying any link to the testing suite, so this bug report is hardly useful. Providing a dmesg with debug options from running just the failing subtest would be a good start to guess something. For proper bug reporting, please see: https://01.org/linuxgraphics/documentation/how-report-bugs
Additional comment from Kishore in the duplicate bug #106549: "I have tried on default kernel version on ubuntu 18.04 is 4.15-rc20. and also on the drm-tip, i can see the failure"
Kishore, could you point to an easy way to reproduce the failure without running the full test suite, or at least which MDAPI is called when this happens and it source code. Also, please include dmesg output showing the kernel failure (make sure to enable debug information with drm.debug=0x1e log_buf_len=4M
Created attachment 139741 [details] DMESG output Kernel logs
I attached the log I got from Kishore. I wasn't able to spot any kernel failure though.
There's no user API that corresponds to a naive interpretation of "GL_GPU_FREQ_OVERRIDE_MDAPI" (I assume that means to set the gpu frequency). There is a root-only interface to set the global frequency limits and a proposed CONTEXT_SETPARAM to set per-context frequency requests, but that is pending review and has no userspace. Hence, my request for the test case to be improved to report what it actually tried and what actually happened, because at the moment this bug report is not actionable.
Indeed, after some more investigation I found out that the test case uses (via MDAPI) the root-only interface: /sys/devices/pci0000:00/0000:00:02.0/drm/card0/gt_boost_freq_mhz /sys/devices/pci0000:00/0000:00:02.0/drm/card0/gt_max_freq_mhz /sys/devices/pci0000:00/0000:00:02.0/drm/card0/gt_min_freq_mhz to ask the kernel to maintain a static GPU frequency, and then reads it back to check that it stays within a range of the requested value. The test fails because the value it reads back is not what it requested (reproduced with drm-tip). I still don't have dmesg output with debug logs enabled but I'l try to get it next week. For the records, this is what the test prints out once it is run as root: [ RUN ] GfxDrv_DriverAcceptanceQuery.GL_GPU_FREQ_OVERRIDE_MDAPI AvgGpuCoreFrequencyMHz = 947 ( expected in range [855, 950] ) AvgGpuCoreFrequencyMHz = 948 ( expected in range [855, 950] ) AvgGpuCoreFrequencyMHz = 948 ( expected in range [855, 950] ) AvgGpuCoreFrequencyMHz = 947 ( expected in range [855, 950] ) AvgGpuCoreFrequencyMHz = 948 ( expected in range [855, 950] ) AvgGpuCoreFrequencyMHz = 947 ( expected in range [855, 950] ) AvgGpuCoreFrequencyMHz = 948 ( expected in range [855, 950] ) AvgGpuCoreFrequencyMHz = 947 ( expected in range [855, 950] ) AvgGpuCoreFrequencyMHz = 947 ( expected in range [855, 950] ) AvgGpuCoreFrequencyMHz = 797 ( expected in range [855, 950] ) => OUT OF RANGE!
An idle gpu is always reported as running at the min(idle) freq. You have to be very careful in the workload you construct that it does indeed keep the gpu busy if you want to use a sampling method. No sign of a bug yet...
(In reply to Chris Wilson from comment #11) > An idle gpu is always reported as running at the min(idle) freq. You have to > be very careful in the workload you construct that it does indeed keep the > gpu busy if you want to use a sampling method. > Apparently the test case uses queries, not the sampled method of metrics measurement. > No sign of a bug yet... No disagreement there, however any further suggestion as to what to look at to figure out what is going wrong is welcome.
Frequency changes are accompanied by a tracepoint (though note we only say what we ask the hw to do, the hw is at liberty to do whatever it wants), and you can enable the tracepoints for batch submission, so if you have the time and patience to correlate them with the test you can draw pretty a graph. For something along those lines, intel-gpu-overlay.
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/784.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.