Created attachment 139081 [details] journaltcl log file On a new Ryzen 5 2400G build, I am using linux-4.16.3 (Arch Linux). On some boots (~1 out of every 3) I get a kernel panic after modesetting occurs. I have attached the relevant systemd logfile (for the entire boot process). I am not sure what component is failing, but I suspect AMDGPU, since it happens suspiciously after modesetting. If it matters, the mainboard is a Gigabyte "AB350N-Gaminig Wifi", with the latest available BIOS as of writing (BIOS F23d 04/17/2018). Please advise if more information is required.
The journalctl log file shows oopses from the evdev driver, no obvious amdgpu related issues. Can you attach a picture of the output from the actual kernel panic?
Created attachment 139114 [details] journaltcl log file Apologies, as this was the wrong logfile (I was getting some journalctl corruption). But it seems you are correct and this is not a kernel panic, as I still can reboot using the "Magic SysRq key" (which I usually can't during kernel panics), albeit the "Caps Lock" key is unresponsive. Attached is a log that should show the issue. Note the line: abr 25 23:23:09 ZenBox systemd-udevd[297]: worker [308] failed while handling '/devices/pci0000:00/0000:00:08.1/0000:09:00.0' Where '/devices/pci0000:00/0000:00:08.1/0000:09:00.0' seems to be the "graphics card". Symptoms are the same. During some boots, modesetting "seems" to occur, but I get a black screen instead. System is unresponsive, including "Caps Lock" lights not changing on key press. Magic SysRq key does seem to successfully reboot the system. How should I edit the title to correctly to reflect this?
There's a general protection fault within kmem_cache_alloc_trace when dcn10_create_resource_pool calls kzalloc (which looks innocuous). There's another general protection fault in kmem_cache_alloc_trace later, called from cgroup code. Looks like there might be a general memory management related issue.
Maybe you can try enabling KASAN and see if that catches anything earlier.
Created attachment 139154 [details] journaltcl log file with KASAN enabled in kernel I have compiled a new kernel with KASAN module enabled. However, I am not sure I am getting any KASAN related output in the logs (attached). Is there any boot option I should be passing it? Alternatively, can you point me towards a good source of documentation on using KASAN (it is the first time I am trying this).
Looks like KASAN isn't enabled yet — the lines in dmesg containing "PREEMPT SMP NOPTI" should contain "KASAN" as well. FWIW, I just enable CONFIG_KASAN and CONFIG_KASAN_INLINE in .config, I don't have to do anything at runtime to enable it.
I had previously made a mistake in loading the kernel's .config file. I have now managed to compile the kernel with the options for KASAN set. However, booting this kernel results in an instant reboot after displaying the message "Loading initial ramdisk...". I will try to compile another kernel using CONFIG_KASAN_OUTLINE this time and see if I have better luck.
Created attachment 139178 [details] journalctl log with KASAN_OUTLINE Ok, so using CONFIG_KASAN_OUTLINE works. I get a KASAN enabled Kernel, which boots. Every Single Time I cannot reproduce the error with KASAN enabled after more than 15 reboots (I lost count after that). Using the "regular" kernel keeps giving me the error ~1 out of 3 boots. I have attached a KASAN enabled log, but I am not sure how useful it might be.
Getting closer, please try again with kasan_multi_shot on the kernel command line, otherwise KASAN only reports the first thing it catches.
Created attachment 139181 [details] journalctl log with KASAN_OUTLINE and kasan_multi_shot Bingo. Now here is the thing... With the KASAN enabled kernel and the multi_shot option set, I can **never** bot successfully. In fact, even mode-setting is not happening. I am uploading 2 log files, since on one of the occasions, I got a black screen when I was expecting modesetting to happen. On the other log, I got a non-modeset text screen with the KASAN dumps. I hope this is what was needed.
Please provide the output of the following in your kernel build tree: scripts/faddr2line drivers/gpu/drm/amd/amdgpu/amdgpu.ko firmware_parser_create+0xa70/0xd90
francisco@ZenBox [18:03:53] [/usr/lib/modules/4.16.5-1-kasan/build] -> $ scripts/faddr2line drivers/gpu/drm/amd/amdgpu/amdgpu.ko firmware_parser_create+0xa70/0xd90 ERROR: can't find objfile drivers/gpu/drm/amd/amdgpu/amdgpu.ko the directory "drivers/gpu/drm/amd/amdgpu/" contains only a file named "Kconfig", containing the following: ``` config DRM_AMDGPU_SI bool "Enable amdgpu support for SI parts" depends on DRM_AMDGPU help Choose this option if you want to enable experimental support for SI asics. SI is already supported in radeon. Experimental support for SI in amdgpu will be disabled by default and is still provided by radeon. Use module options to override this: radeon.si_support=0 amdgpu.si_support=1 config DRM_AMDGPU_CIK bool "Enable amdgpu support for CIK parts" depends on DRM_AMDGPU help Choose this option if you want to enable support for CIK asics. CIK is already supported in radeon. Support for CIK in amdgpu will be disabled by default and is still provided by radeon. Use module options to override this: radeon.cik_support=0 amdgpu.cik_support=1 config DRM_AMDGPU_USERPTR bool "Always enable userptr write support" depends on DRM_AMDGPU select MMU_NOTIFIER help This option selects CONFIG_MMU_NOTIFIER if it isn't already selected to enabled full userptr support. config DRM_AMDGPU_GART_DEBUGFS bool "Allow GART access through debugfs" depends on DRM_AMDGPU depends on DEBUG_FS default n help Selecting this option creates a debugfs file to inspect the mapped pages. Uses more memory for housekeeping, enable only for debugging. source "drivers/gpu/drm/amd/acp/Kconfig" source "drivers/gpu/drm/amd/display/Kconfig" ``` I did find amdgpu.ko.xz under "/usr/lib/modules/4.16.5-1-kasan/kernel/drivers/gpu/drm/amd/amdgpu/amdgpu.ko.xz" which I have decompressed using "xz -k -d". Running the command you requested resulted in the following output: francisco@ZenBox [18:12:53] [/usr/lib/modules/4.16.5-1-kasan/build] -> $ scripts/faddr2line ../kernel/drivers/gpu/drm/amd/amdgpu/amdgpu.ko firmware_parser_create+0xa70/0xd90 firmware_parser_create+0xa70/0xd90: firmware_parser_create at ??:? PS - Thank you for your patience
(In reply to Francisco Pina Martins from comment #12) > francisco@ZenBox [18:12:53] [/usr/lib/modules/4.16.5-1-kasan/build] > -> $ scripts/faddr2line ../kernel/drivers/gpu/drm/amd/amdgpu/amdgpu.ko > firmware_parser_create+0xa70/0xd90 > firmware_parser_create+0xa70/0xd90: > firmware_parser_create at ??:? What does file ../kernel/drivers/gpu/drm/amd/amdgpu/amdgpu.ko say? If its output doesn't say "not stripped", look for an unstripped version of the module. If the file output for that doesn't say "with debug_info", try enabling CONFIG_DEBUG_INFO (and maybe CONFIG_DEBUG_INFO_REDUCED). If that still doesn't result in better output than ??:? from faddr2line, the best guess so far is that there's an issue somewhere in drivers/gpu/drm/amd/display/dc/bios/bios_parser2.c:bios_parser_construct.
francisco@ZenBox [10:31:49] [~] -> $ file /usr/lib/modules/4.16.5-1-kasan/kernel/drivers/gpu/drm/amd/amdgpu/amdgpu.ko /usr/lib/modules/4.16.5-1-kasan/kernel/drivers/gpu/drm/amd/amdgpu/amdgpu.ko: ELF 64-bit LSB relocatable, x86-64, version 1 (SYSV), BuildID[sha1]=fa6433331b1f8048ce1c9b487d7b67d7a4aa4c31, not stripped It does not say "with debug_info", so I'm compiling a new kernel with CONFIG_DEBUG_INFO and CONFIG_DEBUG_INFO_REDUCED activated. I will post the results as soon as I have them.
Here you go: francisco@ZenBox [11:13:37] [/tmp/build/linux-kasan/src/linux-4.16] -> $ scripts/faddr2line drivers/gpu/drm/amd/amdgpu/amdgpu.ko firmware_parser_create+0xa70/0xd90 firmware_parser_create+0xa70/0xd90: get_integrated_info_v11 at /tmp/build/linux-kasan/src/linux-4.16/drivers/gpu/drm/amd/amdgpu/../display/dc/bios/bios_parser2.c:1572 (inlined by) construct_integrated_info at /tmp/build/linux-kasan/src/linux-4.16/drivers/gpu/drm/amd/amdgpu/../display/dc/bios/bios_parser2.c:1714 (inlined by) bios_parser_create_integrated_info at /tmp/build/linux-kasan/src/linux-4.16/drivers/gpu/drm/amd/amdgpu/../display/dc/bios/bios_parser2.c:1755 (inlined by) bios_parser_construct at /tmp/build/linux-kasan/src/linux-4.16/drivers/gpu/drm/amd/amdgpu/../display/dc/bios/bios_parser2.c:1912 (inlined by) firmware_parser_create at /tmp/build/linux-kasan/src/linux-4.16/drivers/gpu/drm/amd/amdgpu/../display/dc/bios/bios_parser2.c:1927 Is this it?
That does look helpful, thanks! I'll leave it to the DC folks to take it from here. Francisco, just one more thing: Can you double-check that KASAN still reports firmware_parser_create+0xa70/0xd90 with the current amdgpu.ko, or otherwise pass the current values to faddr2line?
I don't think I'll be able to test that, since The kernel with debug enabled weights in at 590Mb, which is larger than my 512Mb /boot partition. Will it work if I just replace the amdgpu.ko file with the debug-enabled version?
(In reply to Francisco Pina Martins from comment #17) > Will it work if I just replace the amdgpu.ko file with the debug-enabled > version? Yeah, that could work. If it doesn't, then it's not a big deal, it's unlikely that the value before the slash has changed but the one after the slash hasn't (if the latter had changed, faddr2line should have complained). Just double-checking.
Created attachment 139239 [details] journalctl log with KASAN_OUTLINE and kasan_multi_shot using amdgpu.ko compiled with debug info Here you go. journalctl log after booting with kasan_multi_shot and using the `amdgpu.ko` file compiled with debug info. Now this is a Frankenkernel monster if I ever saw one. I have taken the liberty of "grepping" for the pattern, and it seems that "firmware_parser_create+0xa70/0xd90" is still there. Thank you for passing this along to the DC folk. Please let me know when a patch I can test is made available. Also, if you need any more information (or try something), just ask away. At the very least, with this bug report I have discovered that the KASAN enabled kernel boots every time, at the expense of an extra second while booting. I did not notice any other performance differences between the KASAN enabled and the stock kernel (was I supposed to?). Once again, thank you, Michael for your patience guiding me through the tasks. Next time will be easier. _-)
Now I notice there's another report for firmware_parser_create+0xa9b/0xd90. What does faddr2line say for that? Though it's weird that KASAN claims the memory written in firmware_parser_create was freed from rcu_cpu_kthread. If that's accurate[0], it might indicate a lower level issue. [0] There is some doubt about that due to the "Frankenkernel monster". :) Does enabling CONFIG_DEBUG_INFO_REDUCED as well allow you to keep debugging symbols for everything, or at least for the vmlinuz image?
Created attachment 139288 [details] journalctl log with KASAN_OUTLINE and kasan_multi_shot using a kernel compiled with debug info Ok, so here's what I did. Since I did not have space for the kernel with both KASAN and debug info enabled (the 590Mb was already with CONFIG_DEBUG_INFO_REDUCED), I got my hands dirty and used nconfig to strip a ton of drivers I was pretty sure I didn't need from the build (stuff like nouveau, industrial controllers, game-pads, etc..). I got the kernel down to ~390Mb which was enough to install. I have attached the journalctl log file with this new kernel (linux-kasan-debug-stripped). As for faddr2line output, here is the original command with the new kernel: ``` francisco@ZenBox [23:09:51] [/usr/lib/modules/4.16.5-1-kasan-debug-stripped/kernel/drivers/gpu/drm/amd/amdgpu] -> $ /usr/lib/modules/4.16.5-1-kasan-debug-stripped/build/scripts/faddr2line amdgpu.ko firmware_parser_create+0xa70/0xd90 firmware_parser_create+0xa70/0xd90: get_integrated_info_v11 at /tmp/build/linux-kasan-debug-stripped/src/linux-4.16/drivers/gpu/drm/amd/amdgpu/../display/dc/bios/bios_parser2.c:1572 (inlined by) construct_integrated_info at /tmp/build/linux-kasan-debug-stripped/src/linux-4.16/drivers/gpu/drm/amd/amdgpu/../display/dc/bios/bios_parser2.c:1714 (inlined by) bios_parser_create_integrated_info at /tmp/build/linux-kasan-debug-stripped/src/linux-4.16/drivers/gpu/drm/amd/amdgpu/../display/dc/bios/bios_parser2.c:1755 (inlined by) bios_parser_construct at /tmp/build/linux-kasan-debug-stripped/src/linux-4.16/drivers/gpu/drm/amd/amdgpu/../display/dc/bios/bios_parser2.c:1912 (inlined by) firmware_parser_create at /tmp/build/linux-kasan-debug-stripped/src/linux-4.16/drivers/gpu/drm/amd/amdgpu/../display/dc/bios/bios_parser2.c:1927 ``` And here is the "new" command with the new kernel: ``` francisco@ZenBox [23:10:27] [/usr/lib/modules/4.16.5-1-kasan-debug-stripped/kernel/drivers/gpu/drm/amd/amdgpu] -> $ /usr/lib/modules/4.16.5-1-kasan-debug-stripped/build/scripts/faddr2line amdgpu.ko firmware_parser_create+0xa9b/0xd90 firmware_parser_create+0xa9b/0xd90: get_integrated_info_v11 at /tmp/build/linux-kasan-debug-stripped/src/linux-4.16/drivers/gpu/drm/amd/amdgpu/../display/dc/bios/bios_parser2.c:1574 (inlined by) construct_integrated_info at /tmp/build/linux-kasan-debug-stripped/src/linux-4.16/drivers/gpu/drm/amd/amdgpu/../display/dc/bios/bios_parser2.c:1714 (inlined by) bios_parser_create_integrated_info at /tmp/build/linux-kasan-debug-stripped/src/linux-4.16/drivers/gpu/drm/amd/amdgpu/../display/dc/bios/bios_parser2.c:1755 (inlined by) bios_parser_construct at /tmp/build/linux-kasan-debug-stripped/src/linux-4.16/drivers/gpu/drm/amd/amdgpu/../display/dc/bios/bios_parser2.c:1912 (inlined by) firmware_parser_create at /tmp/build/linux-kasan-debug-stripped/src/linux-4.16/drivers/gpu/drm/amd/amdgpu/../display/dc/bios/bios_parser2.c:1927 ``` This is no longer a "franken-kernel-mosnter", does this output reveal anything new?
It still says the memory is freed from rcu_cpu_kthread. Weird. It's not clear this actually is an amdgpu issue.
This is a bit out of my league... but should this issue be filed somewhere else then? If yes, where, and what information should I provide? Also, is there a possibility that there are multiple issues at play here?
(In reply to Francisco Pina Martins from comment #23) > [...] should this issue be filed somewhere else then? > If yes, where, I'm not sure. :( Maybe start with memory management (https://www.linux-mm.org/, https://bugzilla.kernel.org/describecomponents.cgi?product=Memory%20Management) > and what information should I provide? At least all the same information as here, I guess. > Also, is there a possibility that there are multiple issues at play here? Quite possibly.
I have submitted the bug to kernel bugzilla as you have suggested: https://bugzilla.kernel.org/show_bug.cgi?id=199613 Is there anything else I can do here to help track the eventual problem with AMDGPU? Or do you think the AMDGPU memory problem is being caused by `rcu_cpu_kthread`? Is there anything else you can recommend me to do in order to figure out to disentangle eventual multiple issues?
(In reply to Francisco Pina Martins from comment #25) > Is there anything else I can do here to help track the eventual problem with > AMDGPU? Or do you think the AMDGPU memory problem is being caused by > `rcu_cpu_kthread`? That's what it looks like from the KASAN output.
The issue could get reproduced on 4.16.3, but not on 4.16-rc7. I've verified the commit a0f282dcdb1775cbcc0a151570fc01c0aae5ca0f (current top) on amd-staging-drm-next without seeing the issue on Raven by having 20 times bootup. Please give a try on that commit at your setup. Thanks.
Currently "amd-staging-drm-next" fails to build for me, with the following error: ``` ../lib/str_error_r.c: In function ‘str_error_r’: ../lib/str_error_r.c:25:3: error: passing argument 1 to restrict-qualified parameter aliases with argument 5 [-Werror=restrict] snprintf(buf, buflen, "INTERNAL ERROR: strerror_r(%d, %p, %zd)=%d", errnum, buf, buflen, err); ^~~~~~~~ ``` From what I was able to research, it seems to be missing a patch that was applied in March (https://patchwork.kernel.org/patch/10291671/). But maybe I'm doing something wrong here, since I'm not very experienced. Or did you mean for me to try building linux-4.16, and patch it with commit "a0f282dcdb1775cbcc0a151570fc01c0aae5ca0f" from the "amd-staging-drm-next" tree?
Created attachment 139779 [details] journaltcl log file for linux-amd-staging-drm-next I was able to compile the "amd-staging-drm-next" tree with the help of a small patch (https://github.com/StuntsPT/linux-amd-staging-drm-next-git/blob/master/PKGBUILD#L50) as of commit "46c04bb3e028217255b578cc6101823e9fbc11bc". However, using this kernel I can never get a successful boot (10/10) failures. I have attached the journalctl log for one of these failed boots. I hope this helps.
The commit on amd-staging-drm-next I checked out for verification is: Author: Shaoyun Liu <Shaoyun.Liu@amd.com> AuthorDate: Tue May 22 11:45:41 2018 -0400 Commit: Alex Deucher <alexander.deucher@amd.com> CommitDate: Thu May 24 10:28:35 2018 -0500 drm/amdgpu: Update GFX info structure to match what vega20 used Update to the latest version from the vbios team. Signed-off-by: Shaoyun Liu <Shaoyun.Liu@amd.com> Acked-by: Alex Deucher <alexander.deucher@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> It was working for me to directly checkout the commit and build. It is 4.16-rc7 build.
I have used [this PKGBUILD](https://aur.archlinux.org/cgit/aur.git/tree/PKGBUILD?h=linux-amd-staging-drm-next-git) to build the kernel. Albeit considering the version number it looks more like linux-4.17 than 4.16. The commit I checked out was 46c04bb3e028217255b578cc6101823e9fbc11bc. I will investigate further and try checking out at commit a0f282dcdb1775cbcc0a151570fc01c0aae5ca0f and see if there is any other difference in the build. The fact that you did not require the patch makes me think I'm doing something fundamentally different from you.
Happens to me too (kernel 4.17 ubuntu 18.04), especially the first boot of the day I always get black screen after modeset. I haven't tried with KASAN and can't get any log but I suspect is the same issue. Asrock ab350m pro4 - firmware 4.7 PSU seasonic s12ii ram 2x8 gb kingston hyperx fury 2666 CPU ryzen 2200g
Created attachment 140596 [details] relevant kernel 4.17.5 log of the oops
Created attachment 140597 [details] relevant kernel 4.17.5 source of the oops
Sorry for the spam, I thought the two attachments would be inserted in the same comment. I think I'm afflicted by the same bug on different HW (Asrock AB350 ixt, Ryzen 5 2400G), Debian Stretch running vanilla kernel 4.17.5. I would say my boot rate is 1 in 4~5 attempts. Feel free to require other info.
With recent development kernels from https://cgit.freedesktop.org/~agd5f/ (drm-next-4.19-wip, I think commit ddf74e79a54070f277ae520722d3bab7f7a6c67a) I can consistently complete cold/warm boot on my 2400G, before it was 1 in 4~5 attempts. I think unrelated to the above, I still have various asserts with stack traces on the logs in the "write_i2c_retimer_setting" function in "drivers/gpu/drm/amd/display/dc/core/dc_link.c". They seems to be all write failures but they don't seem fatal.
Confirming that the issue seems to be solved with mainline linux-4.19-rc[1,2].
(In reply to Francisco Pina Martins from comment #37) > Confirming that the issue seems to be solved with mainline > linux-4.19-rc[1,2]. I'm glad to hear that! Resolving accordingly.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.