Bug 101387

Summary: amdgpu display corruption and hang on AMD A10-9620P
Product: DRI Reporter: Carlo Caione <carlo>
Component: DRM/AMDgpuAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED MOVED QA Contact:
Severity: major    
Priority: medium    
Version: XOrg git   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
Trace/freeze on amdgpu modprobing
none
log no stack protection
none
possible fix none

Description Carlo Caione 2017-06-12 10:21:41 UTC
Created attachment 131886 [details]
Trace/freeze on amdgpu modprobing

As reported by Daniel Drake in [0] 

We are working with new laptops that have the AMD Bristol Ridge
chipset with this SoC:

AMD A10-9620P RADEON R5, 10 COMPUTE CORES 4C+6G

I think this is the Bristol Ridge chipset.

During boot, the display becomes unusable at the point where the
amdgpu driver loads. You can see at least two horizontal lines of
garbage at this point. We have reproduced on 4.8, 4.10 and linus
master (early 4.12).

Photo: http://pasteboard.co/qrC9mh4p.jpg

In attachment the log taken using netconsole on amdgpu modprobing.


[0] http://www.spinics.net/lists/dri-devel/msg140915.html
Comment 1 Carlo Caione 2017-06-12 11:13:57 UTC
Created attachment 131890 [details]
log no stack protection

Just an update on this. If I recompile the kernel forcing CONFIG_CC_STACKPROTECTOR_NONE == none I don't have the panic anymore (of course) but I can still see the corruption at boot (and weird artifacts on plymouth). The logs are shown in the attachment.
Comment 2 Michel Dänzer 2017-06-13 02:33:38 UTC
Looks like amdgpu_atombios_crtc_powergate_init scribbles over the stack.

(In reply to Carlo Caione from comment #1)
> If I recompile the kernel forcing CONFIG_CC_STACKPROTECTOR_NONE == none I don't
> have the panic anymore (of course) but I can still see the corruption at boot
> (and weird artifacts on plymouth).

What kind of artifacts in plymouth? Does e.g. X work correctly?
Comment 3 Carlo Caione 2017-06-13 07:49:53 UTC
> Looks like amdgpu_atombios_crtc_powergate_init scribbles over the stack.
Yeah, and while this is causing the panic this is not related to the display corruption.

The garbage lines we see at boot seem to be displayed at the same time when in the log we have:

[drm] amdgpu atom DIG backlight initialized
[drm] AMDGPU Display Connectors
[drm] Connector 0:
[drm]   eDP-1
[drm]   HPD1
[drm]   DDC: 0x4868 0x4868 0x4869 0x4869 0x486a 0x486a 0x486b 0x486b
[drm]   Encoders:
[drm]     LCD1: INTERNAL_UNIPHY
[drm] Connector 1:
[drm]   HDMI-A-1
[drm]   HPD2
[drm]   DDC: 0x486c 0x486c 0x486d 0x486d 0x486e 0x486e 0x486f 0x486f
[drm]   Encoders:
[drm]     DFP1: INTERNAL_UNIPHY
[drm] amdgpu atom DIG backlight initialized
[drm] AMDGPU Display Connectors
[drm] Connector 0:
[drm]   eDP-1
[drm]   HPD1
[drm]   DDC: 0x4868 0x4868 0x4869 0x4869 0x486a 0x486a 0x486b 0x486b
[drm]   Encoders:
[drm]     LCD1: INTERNAL_UNIPHY
[drm] Connector 1:
[drm]   HDMI-A-1
[drm]   HPD2
[drm]   DDC: 0x486c 0x486c 0x486d 0x486d 0x486e 0x486e 0x486f 0x486f
[drm]   Encoders:
[drm]     DFP1: INTERNAL_UNIPHY


> What kind of artifacts in plymouth?
https://pasteboard.co/hWu18yPj0.jpg

> Does e.g. X work correctly?
Yes, disabling the stack protection xorg works fine (no panic, but I still have the display corruption at boot and on poweroff/reboot)
Comment 4 Michel Dänzer 2017-06-13 08:44:32 UTC
(In reply to Carlo Caione from comment #3)
> > What kind of artifacts in plymouth?
> https://pasteboard.co/hWu18yPj0.jpg

That could be a plymouth bug. I've noticed that it doesn't seem to call drmModeCrtcSetGamma to make sure the display CLUTs are initialized to appropriate values.
Comment 5 Carlo Caione 2017-06-13 08:49:08 UTC
> That could be a plymouth bug. I've noticed that it doesn't seem to call
> drmModeCrtcSetGamma to make sure the display CLUTs are initialized to
> appropriate values.

Yeah, it could be, I'll investigate the plymouth issue later, right now the kernel panic situation is the top priority on my list :D

Any idea what's triggering the stack corruption? I'm still investigating but right now my bet is on casting args to (uint32_t *) in


amdgpu_atom_execute_table(adev->mode_info.atom_context, index, (uint32_t *)&args);
Comment 6 Carlo Caione 2017-06-13 11:35:43 UTC
Uhm, probably I have found something.

In amdgpu_atombios_crtc_powergate_init() we are declaring

ENABLE_DISP_POWER_GATING_PARAMETERS_V2_1 args;

so that args is basically a 32byte struct. We are passing down this struct to amdgpu_atom_execute_table() casting it to (uint32_t *). This address is then assigned to (uint32_t *) ectx.ps in amdgpu_atom_execute_table_locked().

At a certain point during the execution of the code in the table with index = 75, atom_put_dst() is called with argument ATOM_ARG_PS and index == 1. So we are doing:

ctx->ps[idx] = cpu_to_le32(val);

but being idx == 1, we are accessing over the boundaries of args, so triggering the stack corruption.

Is this analysis correct and if it is how can we fix this?
Comment 7 Carlo Caione 2017-06-13 13:13:10 UTC
Just to be clear. I have already verified that the address of &args and ctx->ps are the same, so this is definitely something we want to fix somehow.
Comment 8 Carlo Caione 2017-06-14 11:59:41 UTC
Just a better description of what's going on and a couple of questions.

When amdgpu_atombios_crtc_powergate_init() is called this triggers the parsing of the command table with index == 13 (>> execute C5C0 (len 589, WS 0, PS 0)). As already reported the parameter space used (struct ENABLE_DISP_POWER_GATING_PARAMETERS_V2_1) is 32 bytes wide.

During the execution of this table several CALL_TABLE (op == 82) are executed. 

In particular we first just to table with index == 78 (>> execute F166 (len 588, WS 0, PS 8)), then to table with index == 51 (>> execute F446 (len 465, WS 4, PS 4)) and finally to table with index == 75 (>> execute F6CC (len 1330, WS 4, PS 0)) before finally reaching the EOT for table 13.

During the execution of table 75 a MOVE_PS is executed with a destination index == 1, accessing ctx->ps[idx] and causing the stack corruption.

So either the atombios code is wrong or the atombios interpreter in the kernel is doing something wrong.

I also have a couple of questions / observations:

1) Table 75 has WS == 4 and PS == 0 and looking at the opcodes in the table I basically have only *_WS opcodes (MOVE_WS, TEST_WS, ADD_WS, etc...) and just two *_PS instructions (MOVE_PS and OR_PS) that (guess what) are the instructions causing the stack corruption. My guess here is that the opcodes *_PS in the atombios are wrong and they should actually be *_WS opcodes.

2) Don't we need to allocate the size of the ps allocation struct for the command table we are going to execute after a CALL_TABLE matching the ps size in the table header? IIUC the code in the kernel, when we are jumping to a different table ctx->ps is not being reallocated.

3) Could the point at (2) also be a problem in our case? Assuming that ps read from the table header has something to do with the size of the parameter space (guessing here) Table 13 has PS == 0, while table 75 has PS == 4 whereas both are using the same ctx->ps.
Comment 9 Alex Deucher 2017-06-15 14:44:06 UTC
Can you please attach a copy of your vbios?  

(as root)
(use lspci to get the bus id)
cd /sys/bus/pci/devices/<pci bus id>
echo 1 > rom
cat rom > /tmp/vbios.rom
echo 0 > rom
Comment 10 Alex Deucher 2017-06-15 15:03:09 UTC
Created attachment 131982 [details] [review]
possible fix

This patch should fix the issue.
Comment 11 Carlo Caione 2017-06-15 15:17:08 UTC
Yes, that solves the panic issue, thank you. I still have the corruption at boot though, any idea what that can be?
Comment 12 Carlo Caione 2017-06-15 15:32:59 UTC
Also (in case it is useful for the corruption issue):

00:01.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 9874 (rev ca)
...
03:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 699f (rev ff)

# echo 1 > /sys/bus/pci/devices/0000\:00\:01.0/rom
# cat /sys/bus/pci/devices/0000\:00\:01.0/rom > /tmp/vbios.rom
cat: /sys/bus/pci/devices/0000:00:01.0/rom: Input/output error

...

# echo 1 > /sys/bus/pci/devices/0000\:03\:00.0/rom
# cat /sys/bus/pci/devices/0000\:03\:00.0/rom > /tmp/vbios.rom
cat: /sys/bus/pci/devices/0000:03:00.0/rom: Input/output error
Comment 13 Alex Deucher 2017-06-15 15:40:15 UTC
(In reply to Carlo Caione from comment #12)
> Also (in case it is useful for the corruption issue):
> 

not likely related.

> 00:01.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI]
> Device 9874 (rev ca)
> ...
> 03:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Device
> 699f (rev ff)
> 
> # echo 1 > /sys/bus/pci/devices/0000\:00\:01.0/rom
> # cat /sys/bus/pci/devices/0000\:00\:01.0/rom > /tmp/vbios.rom
> cat: /sys/bus/pci/devices/0000:00:01.0/rom: Input/output error

If this is an UEFI system, that may not work.
Comment 14 Carlo Caione 2017-06-15 16:35:01 UTC
> not likely related.

Alright, any hint where to start looking? As I said before it looks like the visual corruption (this https://pasteboard.co/qrC9mh4p.jpg) at boot starts at

[drm] amdgpu atom DIG backlight initialized
[drm] AMDGPU Display Connectors
[drm] Connector 0:
[drm]   eDP-1
[drm]   HPD1
[drm]   DDC: 0x4868 0x4868 0x4869 0x4869 0x486a 0x486a 0x486b 0x486b
[drm]   Encoders:
[drm]     LCD1: INTERNAL_UNIPHY
[drm] Connector 1:
[drm]   HDMI-A-1
[drm]   HPD2
[drm]   DDC: 0x486c 0x486c 0x486d 0x486d 0x486e 0x486e 0x486f 0x486f
[drm]   Encoders:
[drm]     DFP1: INTERNAL_UNIPHY
[drm] amdgpu atom DIG backlight initialized
[drm] AMDGPU Display Connectors
[drm] Connector 0:
[drm]   eDP-1
[drm]   HPD1
[drm]   DDC: 0x4868 0x4868 0x4869 0x4869 0x486a 0x486a 0x486b 0x486b
[drm]   Encoders:
[drm]     LCD1: INTERNAL_UNIPHY
[drm] Connector 1:
[drm]   HDMI-A-1
[drm]   HPD2
[drm]   DDC: 0x486c 0x486c 0x486d 0x486d 0x486e 0x486e 0x486f 0x486f
[drm]   Encoders:
[drm]     DFP1: INTERNAL_UNIPHY

Fortunately it disappears after ~2 seconds when plymouth kicks in. My impression is that the visual corruption is triggered again when interpreting some atombios code, not really easy to debug.
Comment 15 Alex Deucher 2017-06-15 16:59:47 UTC
(In reply to Carlo Caione from comment #14)
> 
> Fortunately it disappears after ~2 seconds when plymouth kicks in. My
> impression is that the visual corruption is triggered again when
> interpreting some atombios code, not really easy to debug.

Probably just intermediate garbage that happens to be visible while the memory controller or display hw is being reprogrammed.
Comment 16 Carlo Caione 2017-06-15 18:17:46 UTC
I see. The problem is that the garbage lines result in a very bad experience for the user. Do you have any idea how to fix / mitigate that?
Comment 17 Alex Deucher 2017-06-15 18:40:56 UTC
(In reply to Carlo Caione from comment #16)
> I see. The problem is that the garbage lines result in a very bad experience
> for the user. Do you have any idea how to fix / mitigate that?

If it's related to reprogramming the MC, this patch series may help:
https://lists.freedesktop.org/archives/amd-gfx/2017-February/005675.html
It hasn't landed yet due to regressions with VCE.
Comment 18 Alex Deucher 2017-06-15 21:51:08 UTC
The VCE regression is fixed.  If you want to try those patches, they are available here:
https://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-next-4.13-wip
Comment 19 Carlo Caione 2017-06-16 14:43:46 UTC
Those patches do not solve the issue but I noticed that the visual corruption at boot is not reproducible on the latest linus master. I bisected and found out that the commit solving the problem is https://patchwork.freedesktop.org/patch/151545/

So the latest problem still to be addressed is the scrambled colors in plymouth on reboot / poweroff.

Thanks,
Comment 20 Carlo Caione 2017-06-20 17:47:42 UTC
> That could be a plymouth bug. I've noticed that it doesn't seem to call
> drmModeCrtcSetGamma to make sure the display CLUTs are initialized to
> appropriate values.
Can you expand on this a little bit?

I have (by chance TBH) fixed plymouth basically reading the CLUTs using drmModeCrtcGetGamma() when plymouthd starts and writing them back using drmModeCrtcSetGamma() before drmModeSetCrtc() is called when the reboot splash has to be displayed.

But it's not clear to me why this is needed at all since in theory I'm writing back the same CLUTs values read by drmModeCrtcGetGamma() but at the same time if I do not do that, I have image corruption.

Any hint about this?
Comment 21 Michel Dänzer 2017-06-21 01:08:04 UTC
Plymouth shouldn't get the values with drmModeCrtcGetGamma — there's no guarantee that the CLUT contains suitable values at that point (e.g. the hardware could currently be programmed for a different colour bit depth in the first place). It should initialize a suitable (e.g. linear) gamma ramp and set that with drmModeCrtcSetGamma.
Comment 22 Carlo Caione 2017-06-21 06:26:10 UTC
> Plymouth shouldn't get the values with drmModeCrtcGetGamma — there's no
> guarantee that the CLUT contains suitable values at that point (e.g. the
> hardware could currently be programmed for a different colour bit depth
> in the first place).
Ok, but at this point I wonder (sorry, not really familiar with libdrm) why the value reported by drmModeCrtcGetGamma are not coherent with the values currently programmed in hardware. Is this a common thing?

If __immediately__ before drmModeSetCrtc I do drmModeCrtcGetGamma + drmModeCrtcSetGamma with the values retrieved, everything works fine with the correct colors, so the CLUTs returned by drmModeCrtcGetGamma actually contain suitable / correct values but apparently these are different from what is programmed in the hardware.
Comment 23 Michel Dänzer 2017-06-21 06:49:17 UTC
Sounds like there is also a bug somewhere in the kernel (driver), but even if that was fixed, it still wouldn't solve the plymouth issue in all cases.
Comment 24 Carlo Caione 2017-06-21 09:26:28 UTC
> Sounds like there is also a bug somewhere in the kernel (driver), but
> even if that was fixed, it still wouldn't solve the plymouth issue
> in all cases.
Yeah, so what should be the right way of proceeding here? I image something like:
- Retrieve the gamma size from the kernel with drmModeGetCrtc()
- Build a linear ramp using the correct size
- Set it with drmModeCrtcSetGamma()
- Call drmModeSetCrtc()
Comment 25 Michel Dänzer 2017-06-22 08:27:33 UTC
(In reply to Carlo Caione from comment #24)
> - Retrieve the gamma size from the kernel with drmModeGetCrtc()

This could be wrong if the currently set pixel format uses a different LUT size. Plymouth can hardcode the LUT size according to the pixel format it uses.

The rest of your sequence looks good to me, but you might want to have it double-checked by a KMS expert.
Comment 26 Martin Peres 2019-11-19 08:19:14 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/190.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.