Bug 98832

Summary: Distorted colours after suspend / resume cycle
Product: xorg Reporter: Jan-Marek Glogowski <glogow>
Component: Driver/RadeonAssignee: xf86-video-ati maintainers <xorg-driver-ati>
Status: NEW --- QA Contact: Xorg Project Team <xorg-team>
Severity: normal    
Priority: medium CC: felix.schwarz
Version: 7.7 (2012.06)   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
URL: https://bugs.launchpad.net/bugs/1643843
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
Broken unity login screen after resume
none
Ubuntu Linux kernel diff of 3.13 none

Description Jan-Marek Glogowski 2016-11-23 16:42:51 UTC
When waking up the system from a suspend-resume cycle, the display most times shows distorted colours.
As a workaround one can switch to the linux console (VT) and back to fix the display.
Alternatively triggering DPMS also fixes the problem (xset dpms force off ; sleep 0.1; xset dpms force on)

Hardware:
 01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI]
 Caicos [Radeon HD 6450/7450/8450 / R5 230 OEM]
 PCIID: 1002:6779

I've tested Ubuntu Xenial (16.04) with the latest updates and Ubuntu Trusty (14.04) with various HWE stacks. Full history of the triage is in the Ubuntu bug.

In the end I did a bisecting between the broken v4.1.3 and the fixed v4.1.4 kernel, as bisecting between v4.1 and v4.2 didn't find a fixed kernel. Probably I should run an other bisect between v4.0 and v4.1 to find the commit which really broke it.

$ git bisect log
# bad: [89e419960fb6a260f6a112821507d516117d5aa1] Linux 4.1.4
# good: [c8bde72f9af412de57f0ceae218d648640118b0b] Linux 4.1.3
git bisect start 'v4.1.4' 'v4.1.3'
# good: [e0cf83cc3de0341d8cabbb23097ad85f5ce97a11] drm/qxl: Do not leak memory if qxl_release_list_add fails
git bisect good e0cf83cc3de0341d8cabbb23097ad85f5ce97a11
# bad: [9d680e03989324f000596be4862728a7c30f22c1] selinux: don't waste ebitmap space when importing NetLabel categories
git bisect bad 9d680e03989324f000596be4862728a7c30f22c1
# bad: [510c99974fdbc18f143c41cbd461c522f5ad7164] tpm, tpm_crb: fix le64_to_cpu conversions in crb_acpi_add()
git bisect bad 510c99974fdbc18f143c41cbd461c522f5ad7164
# bad: [8b941a43ea7709111d3cbea2bdfcc678975255da] drm/radeon: only check the sink type on DP connectors
git bisect bad 8b941a43ea7709111d3cbea2bdfcc678975255da
# good: [0f2bb042f21bdb28f20efcf1ff1c507e2f8b3caa] drm/i915: Declare the swizzling unknown for L-shaped configurations
git bisect good 0f2bb042f21bdb28f20efcf1ff1c507e2f8b3caa
# good: [1f977d7e942519127aea0a08e9d08437b363cf19] drm/i915: Use two 32bit reads for select 64bit REG_READ ioctls
git bisect good 1f977d7e942519127aea0a08e9d08437b363cf19
# good: [7b49262b642511a16699cc63cf2a716739f0c43f] drm/radeon: SDMA fix hibernation (CI GPU family).
git bisect good 7b49262b642511a16699cc63cf2a716739f0c43f
# bad: [d1a4362d41e4feb52df6464f70fb64f21b894623] Revert "drm/radeon: dont switch vt on suspend"
git bisect bad d1a4362d41e4feb52df6464f70fb64f21b894623
# first bad commit: [d1a4362d41e4feb52df6464f70fb64f21b894623] Revert "drm/radeon: dont switch vt on suspend"

Just remember that good and bad is inverted, as I was looking for the commit which fixes my colour problem in v4.1.4. For me it fixes more then just the cursor, which I also saw breaking.
Comment 1 Jan-Marek Glogowski 2016-11-23 16:48:22 UTC
Created attachment 128167 [details]
Broken unity login screen after resume
Comment 2 Jan-Marek Glogowski 2016-11-24 09:38:20 UTC
I can't bisect the problem between v4.0 and v4.1, because for the two kernels I build in this range, suspend was broken and the PC didn't wake up from 2nd suspend, so I don't have a way to reproduce the bug :-(

I'm thinking about building the radeon driver only as an external module. An I just saw it'S possibe to include a path in the bisect, which might help to bisect just changes in drivers/gpu/drm/radeon/…

I also tried 4.9-rc6 (albeit on an Arch Linux), which shows the same bug.
Comment 3 Michel Dänzer 2016-11-24 09:57:31 UTC
Maybe we need to add a radeon_crtc_load_lut call somewhere in the resume path.
Comment 4 Jan-Marek Glogowski 2016-11-24 11:21:43 UTC
Just finished my directory based bisecting, which came to the same conclusion

$ git bisect log
# bad: [b953c0d234bc72e8489d3bf51a276c5c4ec85345] Linux 4.1
# good: [39a8804455fb23f09157341d3ba7db6d7ae6ee76] Linux 4.0
git bisect start 'v4.1' 'v4.0' 'drivers/gpu/drm/radeon/'
# bad: [9e87e48f8e5de2146842fd0ff436e0256b52c4a9] Merge tag 'drm-intel-next-2015-03-27-merge' of git://anongit.freedesktop.org/drm-intel into drm-next
git bisect bad 9e87e48f8e5de2146842fd0ff436e0256b52c4a9
# bad: [c6d2ac2c36f80b8be15d47a8da6fca803a432e1c] drm/radeon: add get_allowed_info_register for r6xx/r7xx
git bisect bad c6d2ac2c36f80b8be15d47a8da6fca803a432e1c
# bad: [296deb7167b960d935025de770f3e3c6c2998fbd] drm/radeon/rv7xx/eg: implement get_current_sclk/mclk
git bisect bad 296deb7167b960d935025de770f3e3c6c2998fbd
# good: [a1dcc2778b682361351a369652b66dd2d66cf1d9] drm/radeon: setup quantization_range in AVI infoframe
git bisect good a1dcc2778b682361351a369652b66dd2d66cf1d9
# bad: [d7dbce09b61dbd8c00ea401a2dc734193309cb91] drm/radeon/dpm: add new callbacks to get the current sclk/mclk
git bisect bad d7dbce09b61dbd8c00ea401a2dc734193309cb91
# bad: [d6d2a1882a79c1a5425d6f82b2fc7b934916f893] drm/radeon: add INFO query for GPU temperature
git bisect bad d6d2a1882a79c1a5425d6f82b2fc7b934916f893
# bad: [b9729b17a414f99c61f4db9ac9f9ed987fa0cbfe] drm/radeon: dont switch vt on suspend
git bisect bad b9729b17a414f99c61f4db9ac9f9ed987fa0cbfe
# first bad commit: [b9729b17a414f99c61f4db9ac9f9ed987fa0cbfe] drm/radeon: dont switch vt on suspend

What else can I provide to help fixing the real bug?
I would like to prevent rolling out the revert as a fix.

So I can put radeon_crtc_load_lut somewhere in the codepath in evergreen_resume, but that will take quite a while without any further advice.
Comment 5 Jan-Marek Glogowski 2016-11-25 18:09:11 UTC
Created attachment 128190 [details] [review]
Ubuntu Linux kernel diff of 3.13

So I thought this bug is the actual problem I'm seeing, or has at least the same origin, but it's not. Using Linux 3.13, I also see this bug when returning after a longer idle time. No suspend is involved.

And still a "xset dpms for off" call fixes the palette problem.

Now I had a look at the HWE kernel of our previous Precise (12.04) based release, where we didn't see this problem with the same HW. The attached diff includes all radeon changes between our working version and the current kernel (~2k lines).

The most prominent change is probably the default switch to DPM for CHIP_CAICOS, but that's really a long shot (profile => dpm in /sys/class/drm/card0/device/power_method).
Comment 6 Jan-Marek Glogowski 2016-11-28 17:37:35 UTC
While trying to reproduce the original bug, which involves long idle / wait times, I'm currently at the point, where I'm quite sure that that radeon.dpm=0 prevents the distorted colours.

I don't know if there exists a code path, which is shared between DPMS, DPM and resume.
Comment 7 Felix Schwarz 2016-12-22 21:50:17 UTC
I saw the same bug on my Radeon HD6450 and I used to revert the mentioned commit locally for quite some time (which fixed the bug). However I stopped doing that when I noticed I could work around the issue simply by switching the VT manually (Ctrl+Alt+F?). Also I did not notice the bug anymore since I switched to Wayland (F25).

Btw: Bug 99163 is about HDMI audio but supposedly fixed by reverting this commit.
Comment 8 Jan-Marek Glogowski 2016-12-22 22:47:03 UTC
Seems I actually forgot to post my last comment…

So I added various versions of:

printk(KERN_ALERT "JMG - %s:%d: ...\n", __FUNCTION__, __LINE__, ....);

And it seems the problem occurs less often now, which is adding to my suspicion of a race condition, as DPMS wake-up doesn't always show broken colours.

I couldn't really identify a place with a missing radeon_crtc_load_lut

[Mo Dez  5 16:10:43 2016] JMG - dce5_crtc_load_lut:108: 0
[Mo Dez  5 16:12:56 2016] JMG - atombios_crtc_dpms:294: 3 (off)
[Mo Dez  5 16:12:56 2016] JMG - atombios_crtc_dpms:294: 1 (standby)
[Mo Dez  5 16:13:56 2016] JMG - atombios_crtc_dpms:294: 2 (suspend)
[Mo Dez  5 16:14:56 2016] JMG - atombios_crtc_dpms:294: 3 (off)
[Mo Dez  5 16:45:52 2016] JMG - atombios_crtc_dpms:294: 0 (on)
[Mo Dez  5 16:45:52 2016] JMG - radeon_crtc_load_lut:196: 1
[Mo Dez  5 16:45:52 2016] JMG - dce5_crtc_load_lut:108: 0

This was definitely calling dce5_crtc_load_lut, but nevertheless the colours were wrong.
I'm not sure why "atombios_crtc_dpms:294: 3" is called twice, probably also for the inactive port?

I couldn't reproduce by just waiting for 5 minutes or setting radeon.dpm=0.

So we did a rollout for my few test machines a few days ago with disabled dpm. Currently the feedback is positive, so we're planing to do the rollout to our few thousand machines next year.

I'm not caring that much about the suspend / resume, but the broken colours after DPMS (so no suspend involved) are highly irritating for the users.
Comment 9 Alberto 2017-11-14 22:09:06 UTC
I'm experiencing this bug with an R5 230 (basically a 6450) and linux 4.11.1.
I can confirm that reverting b9729b17a414f99c61f4db9ac9f9ed987fa0cbfe solves the issue for me. Since it's a one liner and it seems completely harmless, could we revert this upstream?

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct.