Summary: | amdgpu display corruption and hang on AMD A10-9620P | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | Carlo Caione <carlo> | ||||||||
Component: | DRM/AMDgpu | Assignee: | Default DRI bug account <dri-devel> | ||||||||
Status: | RESOLVED MOVED | QA Contact: | |||||||||
Severity: | major | ||||||||||
Priority: | medium | ||||||||||
Version: | XOrg git | ||||||||||
Hardware: | Other | ||||||||||
OS: | All | ||||||||||
Whiteboard: | |||||||||||
i915 platform: | i915 features: | ||||||||||
Attachments: |
|
Description
Carlo Caione
2017-06-12 10:21:41 UTC
Created attachment 131890 [details]
log no stack protection
Just an update on this. If I recompile the kernel forcing CONFIG_CC_STACKPROTECTOR_NONE == none I don't have the panic anymore (of course) but I can still see the corruption at boot (and weird artifacts on plymouth). The logs are shown in the attachment.
Looks like amdgpu_atombios_crtc_powergate_init scribbles over the stack. (In reply to Carlo Caione from comment #1) > If I recompile the kernel forcing CONFIG_CC_STACKPROTECTOR_NONE == none I don't > have the panic anymore (of course) but I can still see the corruption at boot > (and weird artifacts on plymouth). What kind of artifacts in plymouth? Does e.g. X work correctly? > Looks like amdgpu_atombios_crtc_powergate_init scribbles over the stack. Yeah, and while this is causing the panic this is not related to the display corruption. The garbage lines we see at boot seem to be displayed at the same time when in the log we have: [drm] amdgpu atom DIG backlight initialized [drm] AMDGPU Display Connectors [drm] Connector 0: [drm] eDP-1 [drm] HPD1 [drm] DDC: 0x4868 0x4868 0x4869 0x4869 0x486a 0x486a 0x486b 0x486b [drm] Encoders: [drm] LCD1: INTERNAL_UNIPHY [drm] Connector 1: [drm] HDMI-A-1 [drm] HPD2 [drm] DDC: 0x486c 0x486c 0x486d 0x486d 0x486e 0x486e 0x486f 0x486f [drm] Encoders: [drm] DFP1: INTERNAL_UNIPHY [drm] amdgpu atom DIG backlight initialized [drm] AMDGPU Display Connectors [drm] Connector 0: [drm] eDP-1 [drm] HPD1 [drm] DDC: 0x4868 0x4868 0x4869 0x4869 0x486a 0x486a 0x486b 0x486b [drm] Encoders: [drm] LCD1: INTERNAL_UNIPHY [drm] Connector 1: [drm] HDMI-A-1 [drm] HPD2 [drm] DDC: 0x486c 0x486c 0x486d 0x486d 0x486e 0x486e 0x486f 0x486f [drm] Encoders: [drm] DFP1: INTERNAL_UNIPHY > What kind of artifacts in plymouth? https://pasteboard.co/hWu18yPj0.jpg > Does e.g. X work correctly? Yes, disabling the stack protection xorg works fine (no panic, but I still have the display corruption at boot and on poweroff/reboot) (In reply to Carlo Caione from comment #3) > > What kind of artifacts in plymouth? > https://pasteboard.co/hWu18yPj0.jpg That could be a plymouth bug. I've noticed that it doesn't seem to call drmModeCrtcSetGamma to make sure the display CLUTs are initialized to appropriate values. > That could be a plymouth bug. I've noticed that it doesn't seem to call
> drmModeCrtcSetGamma to make sure the display CLUTs are initialized to
> appropriate values.
Yeah, it could be, I'll investigate the plymouth issue later, right now the kernel panic situation is the top priority on my list :D
Any idea what's triggering the stack corruption? I'm still investigating but right now my bet is on casting args to (uint32_t *) in
amdgpu_atom_execute_table(adev->mode_info.atom_context, index, (uint32_t *)&args);
Uhm, probably I have found something. In amdgpu_atombios_crtc_powergate_init() we are declaring ENABLE_DISP_POWER_GATING_PARAMETERS_V2_1 args; so that args is basically a 32byte struct. We are passing down this struct to amdgpu_atom_execute_table() casting it to (uint32_t *). This address is then assigned to (uint32_t *) ectx.ps in amdgpu_atom_execute_table_locked(). At a certain point during the execution of the code in the table with index = 75, atom_put_dst() is called with argument ATOM_ARG_PS and index == 1. So we are doing: ctx->ps[idx] = cpu_to_le32(val); but being idx == 1, we are accessing over the boundaries of args, so triggering the stack corruption. Is this analysis correct and if it is how can we fix this? Just to be clear. I have already verified that the address of &args and ctx->ps are the same, so this is definitely something we want to fix somehow. Just a better description of what's going on and a couple of questions. When amdgpu_atombios_crtc_powergate_init() is called this triggers the parsing of the command table with index == 13 (>> execute C5C0 (len 589, WS 0, PS 0)). As already reported the parameter space used (struct ENABLE_DISP_POWER_GATING_PARAMETERS_V2_1) is 32 bytes wide. During the execution of this table several CALL_TABLE (op == 82) are executed. In particular we first just to table with index == 78 (>> execute F166 (len 588, WS 0, PS 8)), then to table with index == 51 (>> execute F446 (len 465, WS 4, PS 4)) and finally to table with index == 75 (>> execute F6CC (len 1330, WS 4, PS 0)) before finally reaching the EOT for table 13. During the execution of table 75 a MOVE_PS is executed with a destination index == 1, accessing ctx->ps[idx] and causing the stack corruption. So either the atombios code is wrong or the atombios interpreter in the kernel is doing something wrong. I also have a couple of questions / observations: 1) Table 75 has WS == 4 and PS == 0 and looking at the opcodes in the table I basically have only *_WS opcodes (MOVE_WS, TEST_WS, ADD_WS, etc...) and just two *_PS instructions (MOVE_PS and OR_PS) that (guess what) are the instructions causing the stack corruption. My guess here is that the opcodes *_PS in the atombios are wrong and they should actually be *_WS opcodes. 2) Don't we need to allocate the size of the ps allocation struct for the command table we are going to execute after a CALL_TABLE matching the ps size in the table header? IIUC the code in the kernel, when we are jumping to a different table ctx->ps is not being reallocated. 3) Could the point at (2) also be a problem in our case? Assuming that ps read from the table header has something to do with the size of the parameter space (guessing here) Table 13 has PS == 0, while table 75 has PS == 4 whereas both are using the same ctx->ps. Can you please attach a copy of your vbios? (as root) (use lspci to get the bus id) cd /sys/bus/pci/devices/<pci bus id> echo 1 > rom cat rom > /tmp/vbios.rom echo 0 > rom Created attachment 131982 [details] [review] possible fix This patch should fix the issue. Yes, that solves the panic issue, thank you. I still have the corruption at boot though, any idea what that can be? Also (in case it is useful for the corruption issue): 00:01.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 9874 (rev ca) ... 03:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Device 699f (rev ff) # echo 1 > /sys/bus/pci/devices/0000\:00\:01.0/rom # cat /sys/bus/pci/devices/0000\:00\:01.0/rom > /tmp/vbios.rom cat: /sys/bus/pci/devices/0000:00:01.0/rom: Input/output error ... # echo 1 > /sys/bus/pci/devices/0000\:03\:00.0/rom # cat /sys/bus/pci/devices/0000\:03\:00.0/rom > /tmp/vbios.rom cat: /sys/bus/pci/devices/0000:03:00.0/rom: Input/output error (In reply to Carlo Caione from comment #12) > Also (in case it is useful for the corruption issue): > not likely related. > 00:01.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] > Device 9874 (rev ca) > ... > 03:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Device > 699f (rev ff) > > # echo 1 > /sys/bus/pci/devices/0000\:00\:01.0/rom > # cat /sys/bus/pci/devices/0000\:00\:01.0/rom > /tmp/vbios.rom > cat: /sys/bus/pci/devices/0000:00:01.0/rom: Input/output error If this is an UEFI system, that may not work. > not likely related. Alright, any hint where to start looking? As I said before it looks like the visual corruption (this https://pasteboard.co/qrC9mh4p.jpg) at boot starts at [drm] amdgpu atom DIG backlight initialized [drm] AMDGPU Display Connectors [drm] Connector 0: [drm] eDP-1 [drm] HPD1 [drm] DDC: 0x4868 0x4868 0x4869 0x4869 0x486a 0x486a 0x486b 0x486b [drm] Encoders: [drm] LCD1: INTERNAL_UNIPHY [drm] Connector 1: [drm] HDMI-A-1 [drm] HPD2 [drm] DDC: 0x486c 0x486c 0x486d 0x486d 0x486e 0x486e 0x486f 0x486f [drm] Encoders: [drm] DFP1: INTERNAL_UNIPHY [drm] amdgpu atom DIG backlight initialized [drm] AMDGPU Display Connectors [drm] Connector 0: [drm] eDP-1 [drm] HPD1 [drm] DDC: 0x4868 0x4868 0x4869 0x4869 0x486a 0x486a 0x486b 0x486b [drm] Encoders: [drm] LCD1: INTERNAL_UNIPHY [drm] Connector 1: [drm] HDMI-A-1 [drm] HPD2 [drm] DDC: 0x486c 0x486c 0x486d 0x486d 0x486e 0x486e 0x486f 0x486f [drm] Encoders: [drm] DFP1: INTERNAL_UNIPHY Fortunately it disappears after ~2 seconds when plymouth kicks in. My impression is that the visual corruption is triggered again when interpreting some atombios code, not really easy to debug. (In reply to Carlo Caione from comment #14) > > Fortunately it disappears after ~2 seconds when plymouth kicks in. My > impression is that the visual corruption is triggered again when > interpreting some atombios code, not really easy to debug. Probably just intermediate garbage that happens to be visible while the memory controller or display hw is being reprogrammed. I see. The problem is that the garbage lines result in a very bad experience for the user. Do you have any idea how to fix / mitigate that? (In reply to Carlo Caione from comment #16) > I see. The problem is that the garbage lines result in a very bad experience > for the user. Do you have any idea how to fix / mitigate that? If it's related to reprogramming the MC, this patch series may help: https://lists.freedesktop.org/archives/amd-gfx/2017-February/005675.html It hasn't landed yet due to regressions with VCE. The VCE regression is fixed. If you want to try those patches, they are available here: https://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-next-4.13-wip Those patches do not solve the issue but I noticed that the visual corruption at boot is not reproducible on the latest linus master. I bisected and found out that the commit solving the problem is https://patchwork.freedesktop.org/patch/151545/ So the latest problem still to be addressed is the scrambled colors in plymouth on reboot / poweroff. Thanks, > That could be a plymouth bug. I've noticed that it doesn't seem to call
> drmModeCrtcSetGamma to make sure the display CLUTs are initialized to
> appropriate values.
Can you expand on this a little bit?
I have (by chance TBH) fixed plymouth basically reading the CLUTs using drmModeCrtcGetGamma() when plymouthd starts and writing them back using drmModeCrtcSetGamma() before drmModeSetCrtc() is called when the reboot splash has to be displayed.
But it's not clear to me why this is needed at all since in theory I'm writing back the same CLUTs values read by drmModeCrtcGetGamma() but at the same time if I do not do that, I have image corruption.
Any hint about this?
Plymouth shouldn't get the values with drmModeCrtcGetGamma — there's no guarantee that the CLUT contains suitable values at that point (e.g. the hardware could currently be programmed for a different colour bit depth in the first place). It should initialize a suitable (e.g. linear) gamma ramp and set that with drmModeCrtcSetGamma. > Plymouth shouldn't get the values with drmModeCrtcGetGamma — there's no
> guarantee that the CLUT contains suitable values at that point (e.g. the
> hardware could currently be programmed for a different colour bit depth
> in the first place).
Ok, but at this point I wonder (sorry, not really familiar with libdrm) why the value reported by drmModeCrtcGetGamma are not coherent with the values currently programmed in hardware. Is this a common thing?
If __immediately__ before drmModeSetCrtc I do drmModeCrtcGetGamma + drmModeCrtcSetGamma with the values retrieved, everything works fine with the correct colors, so the CLUTs returned by drmModeCrtcGetGamma actually contain suitable / correct values but apparently these are different from what is programmed in the hardware.
Sounds like there is also a bug somewhere in the kernel (driver), but even if that was fixed, it still wouldn't solve the plymouth issue in all cases. > Sounds like there is also a bug somewhere in the kernel (driver), but
> even if that was fixed, it still wouldn't solve the plymouth issue
> in all cases.
Yeah, so what should be the right way of proceeding here? I image something like:
- Retrieve the gamma size from the kernel with drmModeGetCrtc()
- Build a linear ramp using the correct size
- Set it with drmModeCrtcSetGamma()
- Call drmModeSetCrtc()
(In reply to Carlo Caione from comment #24) > - Retrieve the gamma size from the kernel with drmModeGetCrtc() This could be wrong if the currently set pixel format uses a different LUT size. Plymouth can hardcode the LUT size according to the pixel format it uses. The rest of your sequence looks good to me, but you might want to have it double-checked by a KMS expert. -- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/190. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.