Bug 75127 - runpm hang with PowerXpress/hybrid laptop
runpm hang with PowerXpress/hybrid laptop
Status: NEW
Product: DRI
Classification: Unclassified
Component: DRM/Radeon
unspecified
x86-64 (AMD64) Linux (All)
: medium major
Assigned To: Default DRI bug account
:
: 77082 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2014-02-18 00:50 UTC by Sandeep
Modified: 2014-04-11 21:21 UTC (History)
2 users (show)

See Also:


Attachments
linux_kernel_3.14-rc2_dmesg (90.63 KB, text/plain)
2014-02-18 00:50 UTC, Sandeep
no flags Details
linux_kernel_3.13.3_dmesg (90.63 KB, text/plain)
2014-02-18 02:08 UTC, Sandeep
no flags Details
linux_kernel_3.12.12_dmesg (79.87 KB, text/plain)
2014-03-02 19:51 UTC, Sandeep
no flags Details
linux_kernel_3.13.5_dmesg_dpm_disabled (85.81 KB, text/plain)
2014-03-02 19:53 UTC, Sandeep
no flags Details
possible fix (1.83 KB, patch)
2014-04-07 02:13 UTC, Alex Deucher
no flags Details | Splinter Review
possible fix (3.98 KB, patch)
2014-04-09 02:06 UTC, Alex Deucher
no flags Details | Splinter Review
possible fix (4.01 KB, patch)
2014-04-09 05:57 UTC, Alex Deucher
no flags Details | Splinter Review
dmesg, linux 3.13.7, patched with v3 (65.07 KB, text/plain)
2014-04-10 19:18 UTC, kh3095
no flags Details
possible fix (7.07 KB, patch)
2014-04-10 21:15 UTC, Alex Deucher
no flags Details | Splinter Review
dmesg, linux 3.15-git (192.16 KB, text/plain)
2014-04-11 01:42 UTC, Sandeep
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Sandeep 2014-02-18 00:50:55 UTC
Created attachment 94246 [details]
linux_kernel_3.14-rc2_dmesg

I have a laptop with Radeon HD6520G GPU. 

I am running Arch Linux 64 bit with Linux 3.14-rc2 kernel and Mesa 10.0.3

During shutdown, suspend and resume, GPU hangs and I get error messages in the kernel that state:

[drm:atom_op_jump] *ERROR* atombios stuck in loop for more than 5secs aborting
[drm:atom_execute_table_locked] *ERROR* atombios stuck executing D05E (len 62, WS 0, PS 0) @ 0xD07A

Earlier, shutdown would work fine, and laptop would also suspend quickly. 

However now I find that suspend and shutdown take a long time and I see the above error messages.
Comment 1 Sandeep 2014-02-18 01:23:12 UTC
The dmesg output attached above:

linux_kernel_3.14-rc2_dmesg

is when I suspend the laptop and resume.
Comment 2 Sandeep 2014-02-18 02:07:30 UTC
Sorry, I made a mistake - this is with Linux 3.13 kernel
I will re-upload the same file with the right kernel version number in the name
Comment 3 Sandeep 2014-02-18 02:08:22 UTC
Created attachment 94248 [details]
linux_kernel_3.13.3_dmesg
Comment 4 Alex Deucher 2014-02-18 14:54:39 UTC
Is this a regression?  If so can you narrow down what component you changed that caused it?  The atombios messages look to be a side affect of a GPU reset.
Comment 5 Sandeep 2014-02-20 18:36:48 UTC
(In reply to comment #4)
> Is this a regression?  If so can you narrow down what component you changed
> that caused it?  The atombios messages look to be a side affect of a GPU
> reset.

I think it's a regression since hangs didn't occur on shutting down, suspend and resume before. The GPU did hang sometimes when switching between VTs and on playing some games fullscreen. I'm not sure when it started hanging for shutdown, suspend and resume but I think it might be after installing Linux 3.13 kernel.

I will try using older kernel versions to see if the problem exists there as well.
Comment 6 Sandeep 2014-02-22 04:04:05 UTC
Compiled and installed Linux 3.12.12 kernel. No GPU hang problems occur for suspend and resume. Works fine (other than the fact that the laptop's display connected through LVDS is blank).

I will post the dmesg output soon.
Comment 7 Alex Deucher 2014-02-22 14:14:51 UTC
(In reply to comment #6)
> Compiled and installed Linux 3.12.12 kernel. No GPU hang problems occur for
> suspend and resume. Works fine (other than the fact that the laptop's
> display connected through LVDS is blank).
>

Does disabling dpm help?  Boot with radeon.dpm=0 on the kernel command line in grub.  If not, can you bisect the kernel with git to find out what commit caused the regression?
Comment 8 Sandeep 2014-03-02 19:51:31 UTC
Created attachment 94979 [details]
linux_kernel_3.12.12_dmesg

Suspend, resume and shutdown work fine here
Comment 9 Sandeep 2014-03-02 19:52:59 UTC
Unfortunately disabling dpm did not help.

I set radeon.dpm=0 and booted the Linux 3.13.5 kernel, and the same problems still occurred. Will attach dmesg output shortly.

I will try bisecting.
Comment 10 Sandeep 2014-03-02 19:53:46 UTC
Created attachment 94980 [details]
linux_kernel_3.13.5_dmesg_dpm_disabled
Comment 11 Sandeep 2014-03-06 05:56:31 UTC
I've started using git bisect to find the bad commit(s)

I am bisecting between 3.12 and 3.13 (as tagged)

Results so far:
42a2d923cc349583ebf6fdd52a7d35e1c2f7e6bd - good
Comment 12 Sandeep 2014-03-12 23:05:54 UTC
I'm still bisecting, should be done after a few more revisions.
Comment 13 Sandeep 2014-03-13 03:52:58 UTC
Still have 2-3 more revisions to test.

I suspect it is most likely this commit: 10ebc0bc09344ab6310309169efc73dfe6c23d72
Comment 14 Sandeep 2014-03-13 08:38:07 UTC
Confirmed:

This commit : 10ebc0bc09344ab6310309169efc73dfe6c23d72

is the first bad commit where problems occur.
Comment 15 Alex Deucher 2014-03-13 18:10:51 UTC
Please try these patches:
http://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm-fixes-3.14&id=9babd35ad72af631547c7ca294bc2e931cc40e58
http://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm-fixes-3.14&id=7848865914c6a63ead674f0f5604b77df7d3874f

You can also force runpm off by booting with radeon.runpm=0 on the kernel command line in grub.
Comment 16 Sandeep 2014-03-13 19:17:19 UTC
Setting radeon.runpm=0 helped. Suspend, resume work correctly now.

Which kernel version should I apply the patches to and test with? Latest git commit (3.14-git), or stable 3.13.x kernel code?
Comment 17 Alex Deucher 2014-03-13 19:25:17 UTC
(In reply to comment #16)
> Setting radeon.runpm=0 helped. Suspend, resume work correctly now.
> 
> Which kernel version should I apply the patches to and test with? Latest git
> commit (3.14-git), or stable 3.13.x kernel code?

They are against 3.14, but they should apply to 3.13 as well.
Comment 18 Sandeep 2014-03-15 05:19:02 UTC
Unfortunately, those patches did not help. The GPU hang still occurs (I tested without setting radeon.runpm=0).

I applied the patches against 3.13.6 kernel
Comment 19 Sandeep 2014-04-07 00:15:56 UTC
The GPU reset still occurs on Linux kernel 3.14 as well.
Comment 20 Alex Deucher 2014-04-07 00:59:40 UTC
You have a
Comment 21 Alex Deucher 2014-04-07 00:59:52 UTC
*** Bug 77082 has been marked as a duplicate of this bug. ***
Comment 22 Alex Deucher 2014-04-07 01:01:28 UTC
It seems runpm is not working properly on your system. Booting with radeon.runpm=0 reverts back to the 3.12 behavior (PX dGPUs are not dynamically powered down). Did manually powering on/off the dGPU via debugfs ever work on your system?  See the "Forcing the power state of the devices" section of this page:
http://nouveau.freedesktop.org/wiki/Optimus/
for how to test that.
Comment 23 Sandeep 2014-04-07 01:17:27 UTC
Turning off the dedicated GPU works fine, turning off the GPU doesn't.

The dedicated GPU is a Radeon HD 6650M . The kernel identifies it as a TURKS GPU.
Comment 24 Sandeep 2014-04-07 01:20:14 UTC
Oops, typo in last comment. When I turn off the GPU using:

echo OFF > /sys/kernel/debug/vgaswitcheroo/switch

and then try to turn on the GPU using:

echo ON > /sys/kernel/debug/vgaswitcheroo/switch

GPU reset messages are printed in the kernel. 

(e.g)
 7213.870052] [drm:atom_op_jump] *ERROR* atombios stuck in loop for more than 5secs aborting
[ 7213.870055] [drm:atom_execute_table_locked] *ERROR* atombios stuck executing E2F6 (len 2585, WS 4, PS 4) @ 0xE9E0
[ 7213.904826] [drm:radeon_dp_link_train_cr] *ERROR* clock recovery reached max voltage
[ 7213.904827] [drm:radeon_dp_link_train_cr] *ERROR* clock recovery failed
[ 7567.068285] [drm:atom_op_jump] *ERROR* atombios stuck in loop for more than 5secs aborting
[ 7567.068289] [drm:atom_execute_table_locked] *ERROR* atombios stuck executing E2F6 (len 2585, WS 4, PS 4) @ 0xE9E0
[ 7567.103047] [drm:radeon_dp_link_train_cr] *ERROR* clock recovery reached max voltage
[ 7567.103048] [drm:radeon_dp_link_train_cr] *ERROR* clock recovery failed
Comment 25 Alex Deucher 2014-04-07 02:13:49 UTC
Created attachment 97007 [details] [review]
possible fix

Does the attached kernel patch help?
Comment 26 Sandeep 2014-04-09 00:57:07 UTC
Unfortunately, the problem still occurs even with the new patches. I applied them against the latest source code of the kernel from git, after this commit: 18a1a7a1d862ae0794a0179473d08a414dd49234

I still get GPU reset messages even on startup.
Comment 27 Alex Deucher 2014-04-09 02:06:44 UTC
Created attachment 97099 [details] [review]
possible fix

Updated patch.
Comment 28 Sandeep 2014-04-09 05:19:11 UTC
No, unfortunately GPU reset still occurs on startup, suspend, resume and shutdown.

The laptop did suspend faster than earlier cases though, maybe the GPU was able to break out of the reset cycle earlier.
Comment 29 Alex Deucher 2014-04-09 05:57:35 UTC
Created attachment 97106 [details] [review]
possible fix

fix a stupid typo.
Comment 30 kh3095 2014-04-09 23:40:19 UTC
Patch v3 (applied to 3.13.7) doesn't work for me. Again the same messages:

   20.528628] pciehp 0000:00:03.0:pcie04: Device 0000:02:00.0 already exists at 0000:02:00, cannot hot-add
[   20.528807] pciehp 0000:00:03.0:pcie04: Cannot add device at 0000:02:00
Comment 31 Alex Deucher 2014-04-10 13:09:47 UTC
(In reply to comment #30)
> Patch v3 (applied to 3.13.7) doesn't work for me. Again the same messages:
> 
>    20.528628] pciehp 0000:00:03.0:pcie04: Device 0000:02:00.0 already exists
> at 0000:02:00, cannot hot-add
> [   20.528807] pciehp 0000:00:03.0:pcie04: Cannot add device at 0000:02:00

Please attach your dmesg output with the patch applied.
Comment 32 kh3095 2014-04-10 19:18:50 UTC
Created attachment 97179 [details]
dmesg, linux 3.13.7, patched with v3

Here you are...
Comment 33 Alex Deucher 2014-04-10 21:15:56 UTC
Created attachment 97193 [details] [review]
possible fix

New patch.
Comment 34 Sandeep 2014-04-10 23:45:36 UTC
Even with the latest patch applied (https://bugs.freedesktop.org/attachment.cgi?id=97193) the problem still occurs.

The system does recover from the reset faster than before though - suspends and resumes in a few seconds now, whereas earlier it would take a few tens of seconds to snap out of the reset cycle.
Comment 35 Alex Deucher 2014-04-11 01:03:04 UTC
(In reply to comment #34)
> Even with the latest patch applied
> (https://bugs.freedesktop.org/attachment.cgi?id=97193) the problem still
> occurs.
> 
> The system does recover from the reset faster than before though - suspends
> and resumes in a few seconds now, whereas earlier it would take a few tens
> of seconds to snap out of the reset cycle.

Please attach your dmesg output with the patch applied.  It shouldn't try and auto suspend or reset the integrated card at all.  Somehow it seems like runtime pm is still getting applied to the integrated card.
Comment 36 Alex Deucher 2014-04-11 01:07:19 UTC
oh, wait, that's the dGPU that is resetting, not the integrated chip.  Does removing radeon.dpm=1 from your kernel command line in grub help?
Comment 37 Sandeep 2014-04-11 01:10:22 UTC
(In reply to comment #36)
> oh, wait, that's the dGPU that is resetting, not the integrated chip.  Does
> removing radeon.dpm=1 from your kernel command line in grub help?

I will try that now.
Comment 38 Sandeep 2014-04-11 01:30:38 UTC
Results:

Startup(full restart) - no GPU reset
Suspend - GPU reset but recovers quickly
Resume - GPU reset and takes a long time to recover
Comment 39 Alex Deucher 2014-04-11 01:35:39 UTC
Does disabling dpm help (radeon.dpm=0)? if not, any chance you could bisect?  Also, please attach your dmesg output with the latest patch applied.
Comment 40 Sandeep 2014-04-11 01:42:13 UTC
Created attachment 97203 [details]
dmesg, linux 3.15-git

linux 3.15-git-ce7613db2d + patch v4

radeon module parameters at default settings
Comment 41 Sandeep 2014-04-11 01:57:15 UTC
With radeon.dpm=0 and no other module parameters for radeon

Results:

Startup(full restart) - no GPU reset
Suspend - GPU reset but recovers quickly
Resume - GPU reset and takes a long time to recover
Comment 42 Sandeep 2014-04-11 20:07:57 UTC
What exactly do I need to bisect i.e starting and ending commit ?
Comment 43 Alex Deucher 2014-04-11 20:17:00 UTC
(In reply to comment #42)
> What exactly do I need to bisect i.e starting and ending commit ?

git bisect start
git bisect good <commit id or tag>
git bisect bad <commit id or tag>

At this point git will check out the commit halfway between these two.  Test it and report back:
git bisect good //if that commit works
git bisect bad // if that commit is broken

git will checkout the next half way point.  repeat until it's done.  Once you've found the problematic commit:

git bisect reset // resets your tree back to where you were before you started bisecting.  E.g., if it was working in 3.12 and broke in 3.13:

git bisect start
git bisect good v3.12
git bisect bad v3.13
Comment 44 Sandeep 2014-04-11 21:21:49 UTC
Ok, but what should the good and bad commits for the bisect be?

I had already done a bisection earlier and found that the commit adding and enabling runtime power management was where the problems began.