Created attachment 94246 [details] linux_kernel_3.14-rc2_dmesg I have a laptop with Radeon HD6520G GPU. I am running Arch Linux 64 bit with Linux 3.14-rc2 kernel and Mesa 10.0.3 During shutdown, suspend and resume, GPU hangs and I get error messages in the kernel that state: [drm:atom_op_jump] *ERROR* atombios stuck in loop for more than 5secs aborting [drm:atom_execute_table_locked] *ERROR* atombios stuck executing D05E (len 62, WS 0, PS 0) @ 0xD07A Earlier, shutdown would work fine, and laptop would also suspend quickly. However now I find that suspend and shutdown take a long time and I see the above error messages.
The dmesg output attached above: linux_kernel_3.14-rc2_dmesg is when I suspend the laptop and resume.
Sorry, I made a mistake - this is with Linux 3.13 kernel I will re-upload the same file with the right kernel version number in the name
Created attachment 94248 [details] linux_kernel_3.13.3_dmesg
Is this a regression? If so can you narrow down what component you changed that caused it? The atombios messages look to be a side affect of a GPU reset.
(In reply to comment #4) > Is this a regression? If so can you narrow down what component you changed > that caused it? The atombios messages look to be a side affect of a GPU > reset. I think it's a regression since hangs didn't occur on shutting down, suspend and resume before. The GPU did hang sometimes when switching between VTs and on playing some games fullscreen. I'm not sure when it started hanging for shutdown, suspend and resume but I think it might be after installing Linux 3.13 kernel. I will try using older kernel versions to see if the problem exists there as well.
Compiled and installed Linux 3.12.12 kernel. No GPU hang problems occur for suspend and resume. Works fine (other than the fact that the laptop's display connected through LVDS is blank). I will post the dmesg output soon.
(In reply to comment #6) > Compiled and installed Linux 3.12.12 kernel. No GPU hang problems occur for > suspend and resume. Works fine (other than the fact that the laptop's > display connected through LVDS is blank). > Does disabling dpm help? Boot with radeon.dpm=0 on the kernel command line in grub. If not, can you bisect the kernel with git to find out what commit caused the regression?
Created attachment 94979 [details] linux_kernel_3.12.12_dmesg Suspend, resume and shutdown work fine here
Unfortunately disabling dpm did not help. I set radeon.dpm=0 and booted the Linux 3.13.5 kernel, and the same problems still occurred. Will attach dmesg output shortly. I will try bisecting.
Created attachment 94980 [details] linux_kernel_3.13.5_dmesg_dpm_disabled
I've started using git bisect to find the bad commit(s) I am bisecting between 3.12 and 3.13 (as tagged) Results so far: 42a2d923cc349583ebf6fdd52a7d35e1c2f7e6bd - good
I'm still bisecting, should be done after a few more revisions.
Still have 2-3 more revisions to test. I suspect it is most likely this commit: 10ebc0bc09344ab6310309169efc73dfe6c23d72
Confirmed: This commit : 10ebc0bc09344ab6310309169efc73dfe6c23d72 is the first bad commit where problems occur.
Please try these patches: http://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm-fixes-3.14&id=9babd35ad72af631547c7ca294bc2e931cc40e58 http://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm-fixes-3.14&id=7848865914c6a63ead674f0f5604b77df7d3874f You can also force runpm off by booting with radeon.runpm=0 on the kernel command line in grub.
Setting radeon.runpm=0 helped. Suspend, resume work correctly now. Which kernel version should I apply the patches to and test with? Latest git commit (3.14-git), or stable 3.13.x kernel code?
(In reply to comment #16) > Setting radeon.runpm=0 helped. Suspend, resume work correctly now. > > Which kernel version should I apply the patches to and test with? Latest git > commit (3.14-git), or stable 3.13.x kernel code? They are against 3.14, but they should apply to 3.13 as well.
Unfortunately, those patches did not help. The GPU hang still occurs (I tested without setting radeon.runpm=0). I applied the patches against 3.13.6 kernel
The GPU reset still occurs on Linux kernel 3.14 as well.
You have a
*** Bug 77082 has been marked as a duplicate of this bug. ***
It seems runpm is not working properly on your system. Booting with radeon.runpm=0 reverts back to the 3.12 behavior (PX dGPUs are not dynamically powered down). Did manually powering on/off the dGPU via debugfs ever work on your system? See the "Forcing the power state of the devices" section of this page: http://nouveau.freedesktop.org/wiki/Optimus/ for how to test that.
Turning off the dedicated GPU works fine, turning off the GPU doesn't. The dedicated GPU is a Radeon HD 6650M . The kernel identifies it as a TURKS GPU.
Oops, typo in last comment. When I turn off the GPU using: echo OFF > /sys/kernel/debug/vgaswitcheroo/switch and then try to turn on the GPU using: echo ON > /sys/kernel/debug/vgaswitcheroo/switch GPU reset messages are printed in the kernel. (e.g) 7213.870052] [drm:atom_op_jump] *ERROR* atombios stuck in loop for more than 5secs aborting [ 7213.870055] [drm:atom_execute_table_locked] *ERROR* atombios stuck executing E2F6 (len 2585, WS 4, PS 4) @ 0xE9E0 [ 7213.904826] [drm:radeon_dp_link_train_cr] *ERROR* clock recovery reached max voltage [ 7213.904827] [drm:radeon_dp_link_train_cr] *ERROR* clock recovery failed [ 7567.068285] [drm:atom_op_jump] *ERROR* atombios stuck in loop for more than 5secs aborting [ 7567.068289] [drm:atom_execute_table_locked] *ERROR* atombios stuck executing E2F6 (len 2585, WS 4, PS 4) @ 0xE9E0 [ 7567.103047] [drm:radeon_dp_link_train_cr] *ERROR* clock recovery reached max voltage [ 7567.103048] [drm:radeon_dp_link_train_cr] *ERROR* clock recovery failed
Created attachment 97007 [details] [review] possible fix Does the attached kernel patch help?
Unfortunately, the problem still occurs even with the new patches. I applied them against the latest source code of the kernel from git, after this commit: 18a1a7a1d862ae0794a0179473d08a414dd49234 I still get GPU reset messages even on startup.
Created attachment 97099 [details] [review] possible fix Updated patch.
No, unfortunately GPU reset still occurs on startup, suspend, resume and shutdown. The laptop did suspend faster than earlier cases though, maybe the GPU was able to break out of the reset cycle earlier.
Created attachment 97106 [details] [review] possible fix fix a stupid typo.
Patch v3 (applied to 3.13.7) doesn't work for me. Again the same messages: 20.528628] pciehp 0000:00:03.0:pcie04: Device 0000:02:00.0 already exists at 0000:02:00, cannot hot-add [ 20.528807] pciehp 0000:00:03.0:pcie04: Cannot add device at 0000:02:00
(In reply to comment #30) > Patch v3 (applied to 3.13.7) doesn't work for me. Again the same messages: > > 20.528628] pciehp 0000:00:03.0:pcie04: Device 0000:02:00.0 already exists > at 0000:02:00, cannot hot-add > [ 20.528807] pciehp 0000:00:03.0:pcie04: Cannot add device at 0000:02:00 Please attach your dmesg output with the patch applied.
Created attachment 97179 [details] dmesg, linux 3.13.7, patched with v3 Here you are...
Created attachment 97193 [details] [review] possible fix New patch.
Even with the latest patch applied (https://bugs.freedesktop.org/attachment.cgi?id=97193) the problem still occurs. The system does recover from the reset faster than before though - suspends and resumes in a few seconds now, whereas earlier it would take a few tens of seconds to snap out of the reset cycle.
(In reply to comment #34) > Even with the latest patch applied > (https://bugs.freedesktop.org/attachment.cgi?id=97193) the problem still > occurs. > > The system does recover from the reset faster than before though - suspends > and resumes in a few seconds now, whereas earlier it would take a few tens > of seconds to snap out of the reset cycle. Please attach your dmesg output with the patch applied. It shouldn't try and auto suspend or reset the integrated card at all. Somehow it seems like runtime pm is still getting applied to the integrated card.
oh, wait, that's the dGPU that is resetting, not the integrated chip. Does removing radeon.dpm=1 from your kernel command line in grub help?
(In reply to comment #36) > oh, wait, that's the dGPU that is resetting, not the integrated chip. Does > removing radeon.dpm=1 from your kernel command line in grub help? I will try that now.
Results: Startup(full restart) - no GPU reset Suspend - GPU reset but recovers quickly Resume - GPU reset and takes a long time to recover
Does disabling dpm help (radeon.dpm=0)? if not, any chance you could bisect? Also, please attach your dmesg output with the latest patch applied.
Created attachment 97203 [details] dmesg, linux 3.15-git linux 3.15-git-ce7613db2d + patch v4 radeon module parameters at default settings
With radeon.dpm=0 and no other module parameters for radeon Results: Startup(full restart) - no GPU reset Suspend - GPU reset but recovers quickly Resume - GPU reset and takes a long time to recover
What exactly do I need to bisect i.e starting and ending commit ?
(In reply to comment #42) > What exactly do I need to bisect i.e starting and ending commit ? git bisect start git bisect good <commit id or tag> git bisect bad <commit id or tag> At this point git will check out the commit halfway between these two. Test it and report back: git bisect good //if that commit works git bisect bad // if that commit is broken git will checkout the next half way point. repeat until it's done. Once you've found the problematic commit: git bisect reset // resets your tree back to where you were before you started bisecting. E.g., if it was working in 3.12 and broke in 3.13: git bisect start git bisect good v3.12 git bisect bad v3.13
Ok, but what should the good and bad commits for the bisect be? I had already done a bisection earlier and found that the commit adding and enabling runtime power management was where the problems began.
Created attachment 117623 [details] Partial kernel logs with full atom debug info
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/446.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.