Haven't been able to boot agd5f kernels for a while, did a big bisect today. It came up with this commit: commit 6f0359ff73076483902de0c17f9649bf55651e2a Author: Alex Deucher <alexander.deucher@amd.com> Date: Wed Aug 24 17:15:33 2016 -0400 drm/amdgpu/vce3: add support for third vce ring Not of much use at the moment (we don't really use the second ring either), but may be useful later. Reviewed-by: JimQu <Jim.Qu@amd.com> Reviewed-by: Christian König <christian.koenig@amd.com> Signed-off-by: Alex Deucher <alexander.deucher@amd.com> [ 16.822] (--) AMDGPU(0): Chipset: "FIJI" (ChipID = 0x7300)
The patch reverts cleanly on drm-next-4.9-wip and makes that branch bootable also.
Wait, now I'm not so sure about the bisect. Could there be a could/warm boot thing playing tricks with me? Now I couldn't boot even with the patch reverted. But if I reboot from a stable kernel quickly I can boot...
This seems to be very hard to reproduce. Perhaps it's related to the kernel I'm rebooting from or some complex combination. Pretty hard to debug when the computer simply stops responding without any text etc.
(In reply to Ernst Sjöstrand from comment #3) > This seems to be very hard to reproduce. Perhaps it's related to the kernel > I'm rebooting from or some complex combination. > Pretty hard to debug when the computer simply stops responding without any > text etc. Yea, it's tricky sometimes. For my (I guess different) tonga start up issue I had to power down "properly" between bisect steps. By properly I mean shutdown/halt then cut the power to the PSU and wait a couple of minutes before powering up again.
Also the way 4.9-wip changes and things get merged may not make bisecting a long way back that easy compared to something like mesa. I notice it's changed again 17 hours ago - it may even work now without you being able to see any sign od a revert or fix in the history. Maybe testing other branches (after power off of course) would give you a better idea where you are.
Do you think I can get both false positives and false negatives? Have tested a bit more but still behaving in a silly way...
(In reply to Ernst Sjöstrand from comment #6) > Do you think I can get both false positives and false negatives? > Have tested a bit more but still behaving in a silly way... I guess your issue not as clear cut as mine then. If you find that you can sometimes boot and sometimes not on the same kernel after power off, then that's going to be quite a pain and time consuming to find. It's not really normal to have to power off - I haven't needed to for ages, just that with my recent boot up issue I knew I did need to as when I first built it it worked from a reboot, but not the next day from off. I though your issue may be similar, but apparently not.
I have never been able to reproduce the problem with self-compiled 4.8-rc8 from that workspace so I think my local builds are working at least. drm/amd/amdgpu: enable clockgating only after late init sounded like it could be similar to my problem but it didn't help.
I know what may have been fooling me. I run sudo make install on Ubuntu. It adds a .old entry in grub. However that .old entry doesn't have a .old initrd, where the amdgpu kernel module lives, it shares initrd.
Created attachment 127299 [details] [review] Patch that fixes the issue.
Odd. We haven't seen any failures internally with Fiji and 3 VCE rings enabled. Can you provide a log of the failure? Try manually loading amdgpu after boot. E.g., append modprobe.blacklist=amdgpu to the kernel command line and boot to a non-X runlevel, then manually modprobe amdgpu and capture the dmesg output. Please attach the dmesg output from a successful boot as well. If you have remote access via ssh, that would make it easier. What VCE firmware are you using? Can you try the latest version from git and see if that helps: https://git.kernel.org/cgit/linux/kernel/git/firmware/linux-firmware.git
Couldn't get anything after modprobe, the computer really dies. However updating the firmware fixed it. Should there be some kind of check or do you consider this fixed?
I was running the Ubuntu firmware from Xenial, the firmware bundled in Wily fixed it.
(In reply to Ernst Sjöstrand from comment #12) > Couldn't get anything after modprobe, the computer really dies. > However updating the firmware fixed it. > > Should there be some kind of check or do you consider this fixed? I'll add a check. Thanks for confirming.
Created attachment 127542 [details] [review] possible fix Does this patch fix the issue?
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.