Bug 100222 - Hang regression with R7 M370, identified possible culprit commit
Summary: Hang regression with R7 M370, identified possible culprit commit
Status: RESOLVED MOVED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Radeon (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-03-15 23:29 UTC by Mauro Santos
Modified: 2019-11-19 09:26 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
lspci -nnk and dmesg output (19.95 KB, application/gzip)
2017-03-15 23:29 UTC, Mauro Santos
no flags Details
patch 1/2 (1.17 KB, patch)
2017-03-16 01:15 UTC, Alex Deucher
no flags Details | Splinter Review
patch 2/2 (1.30 KB, patch)
2017-03-16 01:15 UTC, Alex Deucher
no flags Details | Splinter Review

Description Mauro Santos 2017-03-15 23:29:16 UTC
Created attachment 130246 [details]
lspci -nnk and dmesg output

After updating from kernel 4.9 series to 4.10 series I have identified a regression when using the discrete GPU on my laptop (Lenovo Thinkpad E560).

When running any demanding application with DRI_PRIME=1 the card will hang, one example would be running 'DRI_PRIME=1 glmark2 -b texture'.

I have noticed that the content of /sys/kernel/debug/dri/1/radeon_pm_info has changed between kernel 4.9 and 4.10 when running glmark2.

With 4.9:
power level 4    sclk: 75000 mclk: 80000 vddc: 1050 vddci: 0 pcie gen: 2

With 4.10:
power level 4    sclk: 87500 mclk: 90000 vddc: 1050 vddci: 0 pcie gen: 2

This led me to revert commit 3a69adfe5617ceba04ad3cff0f9ccad470503fb2 which prevents the card from hanging.

You can find the output of lspci and dmesg in the attachment for the case with commit 3a69adfe5617ceba04ad3cff0f9ccad470503fb2 reverted.
Comment 1 Alex Deucher 2017-03-16 01:15:22 UTC
Created attachment 130250 [details] [review]
patch 1/2

The attached patches should fix it.
Comment 2 Alex Deucher 2017-03-16 01:15:43 UTC
Created attachment 130251 [details] [review]
patch 2/2
Comment 3 Mauro Santos 2017-03-16 12:19:44 UTC
Build fails after applying patch 1 followed by patch 2 with:


drivers/gpu/drm/radeon/si_dpm.c: In function ‘si_get_vce_clock_voltage’:
drivers/gpu/drm/radeon/si_dpm.c:2977:4: error: ‘else’ without a previous ‘if’
  } else if (rdev->family == CHIP_OLAND) {
    ^~~~
drivers/gpu/drm/radeon/si_dpm.c:2985:4: error: ‘max_sclk’ undeclared (first use in this function)
    max_sclk = 75000;
    ^~~~~~~~
drivers/gpu/drm/radeon/si_dpm.c:2985:4: note: each undeclared identifier is reported only once for each function it appears in


The patch changes things inside the si_get_vce_clock_voltage function but I suppose the changes should be made a few lines bellow that to the si_apply_state_adjust_rules function after the quirks for pitcairn and hainan right?

Another thing that I'm curious about, any guesses as to why the card needs the maximum core clock limited to 750MHz on linux but seems to work fine on windows 10 at 875MHz? I've tried it on Windows 10 (all drivers downloaded via windows update) with unigine heaven + cpu-z to monitor the frequencies and it seems to go along happily with 875MHz core and 900MHz memory clocks.
Comment 4 Alex Deucher 2017-03-16 13:03:33 UTC
(In reply to Mauro Santos from comment #3)
> Build fails after applying patch 1 followed by patch 2 with:
> 
> 
> drivers/gpu/drm/radeon/si_dpm.c: In function ‘si_get_vce_clock_voltage’:
> drivers/gpu/drm/radeon/si_dpm.c:2977:4: error: ‘else’ without a previous ‘if’
>   } else if (rdev->family == CHIP_OLAND) {
>     ^~~~
> drivers/gpu/drm/radeon/si_dpm.c:2985:4: error: ‘max_sclk’ undeclared (first
> use in this function)
>     max_sclk = 75000;
>     ^~~~~~~~
> drivers/gpu/drm/radeon/si_dpm.c:2985:4: note: each undeclared identifier is
> reported only once for each function it appears in
> 
> 
> The patch changes things inside the si_get_vce_clock_voltage function but I
> suppose the changes should be made a few lines bellow that to the
> si_apply_state_adjust_rules function after the quirks for pitcairn and
> hainan right?

The patch modifies si_apply_state_adjust_rules, I guess it's not applying cleanly to your kernel.

> 
> Another thing that I'm curious about, any guesses as to why the card needs
> the maximum core clock limited to 750MHz on linux but seems to work fine on
> windows 10 at 875MHz? I've tried it on Windows 10 (all drivers downloaded
> via windows update) with unigine heaven + cpu-z to monitor the frequencies
> and it seems to go along happily with 875MHz core and 900MHz memory clocks.

There is still some bug in the driver that prevents the higher clocks for working stable on your card.  We fixed some issues and the driver was working on the hardware samples we had in house (which is why I removed the workaround), but apparently there are still some variants that are not working correctly.
Comment 5 Mauro Santos 2017-03-16 17:08:18 UTC
(In reply to Alex Deucher from comment #4)

> The patch modifies si_apply_state_adjust_rules, I guess it's not applying
> cleanly to your kernel.

I've retried it with the current git tree and it does apply properly. Before I was trying with kernel 4.9.2. I can confirm that with the patches that were provided the card does not hang.

I have also tried reverting commit 3a69adfe5617ceba04ad3cff0f9ccad470503fb2 from kernel 4.9.2 (leaving only the sclk limitation) and it also works, no hangs with sclk=750MHz and mclk=900MHz.
Comment 6 Martin Peres 2019-11-19 09:26:29 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/784.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.