Created attachment 142062 [details] kernel config System shuts down after reboot command or Ctrl-Alt_Del but fails to actually reboot. The last line output to the Virtual Console is "Rebooting". The screen then turns off as normal, but fails to turn on again with the boot menu. Bisection finds this problem is introduced by commit 0a1d56599b9bb58464a8bf1243191eb32b36b694 which patches drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_debugfs.c Hardware: as documented in Bug 108139
Are you sure the bisect is correct? This commit just changes debugfs which shouldn't be triggered unless you actually write to the file in question. Please attach your xorg log (is using X) and dmesg output.
Yes I thought that was weird (debug fs). But adjacent commit 30cdbfaa6aa469347db7fcda5949f1ccf7559ecf does not show the problem
halt is fine btw, it's only reboot that breaks. Do you want extra debug turned on for dmesg?
Created attachment 142063 [details] dmesg o/p as requested No Xorg involvement - boot up to command line only
Still present at 4.19.0-rc8. Is there any other info I can provide?
(In reply to Duncan Roe from comment #3) > halt is fine btw, it's only reboot that breaks. > Do you want extra debug turned on for dmesg? At Linux 19.0-rc8, power button / halt command also fails. The backlight goes off but the power stays on.
(In reply to Duncan Roe from comment #6) > (In reply to Duncan Roe from comment #3) > > halt is fine btw, it's only reboot that breaks. > > Do you want extra debug turned on for dmesg? > > At Linux 19.0-rc8, power button / halt command also fails. The backlight > goes off but the power stays on. Can you bisect that? Is it the same commit?
(In reply to Alex Deucher from comment #1) > Are you sure the bisect is correct? This commit just changes debugfs which > shouldn't be triggered unless you actually write to the file in question. > Please attach your xorg log (is using X) and dmesg output. No longer sure about that. I had been bisecting in anongit.freedesktop.org/drm/drm. When I switched to git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git, both 0a1d565 and 30cdbfa show the problem. So I can now try to bisect further.
(In reply to Alex Deucher from comment #7) > (In reply to Duncan Roe from comment #6) > > (In reply to Duncan Roe from comment #3) > > > halt is fine btw, it's only reboot that breaks. > > > Do you want extra debug turned on for dmesg? > > > > At Linux 19.0-rc8, power button / halt command also fails. The backlight > > goes off but the power stays on. > > Can you bisect that? Is it the same commit? It is the same commit i.e. 0a1d565 built from the stable tree consistently fails to power off. 30cdbfa *sometimes* fails to power off - I think I have seen 2 fails in6 reboots (my spreadsheet isn't set up to count results (yet)).
(In reply to Alex Deucher from comment #1) > Are you sure the bisect is correct? This commit just changes debugfs which > shouldn't be triggered unless you actually write to the file in question. > Please attach your xorg log (is using X) and dmesg output. A fresh bisect in the stable tree has given a new pair of commits. e1cb3e4801e6896ba93d63222b1052199d2a8c9b has the problem and 899e2aaddbfa0ff96fbaf31f0d9e91427e87dd88 does not have it. (In the stable tree, 30cdbfaa6aa469347db7fcda5949f1ccf7559ecf also has the problem, unlike in the drm tree that I bisected previously). A diff of the new commit pair shows 14 patched files. These are TODO, Makefile and a mixture of .c and .h sources. I am unsure how to proceed with bisecting these diffs: advice welcome.
There is a kernel Oops associated with this problem. I only just discovered it started on the same commit as reboot failure did. You can see the BUG line in attachment 142063 [details] at time 5.075194
Created attachment 142354 [details] [review] Patch for Linux-19.0 to revert e1cb3e4 Revert commit e1cb3e4801e6896ba93d63222b1052199d2a8c9b (drm/amd/display: Convert remaining loggers off dc_logger). Reboot works again and the BUG / Oops is gone.
Created attachment 143007 [details] [review] Diagnostic patches to determine which pointer is null These patches are against Linux 4.19.12, commit 2a7cb228d29c3882c1414c10a44c5f3f59bfa44d in git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
Created attachment 143008 [details] dmesg o/p with attachment 143007 [details] [review] The exception occurs in dc_link_aux_transfer, which is called by dm_dp_aux_transfer which is the top displayed function on the stack after the BUG line in attachment 142063 [details]. There is no BUG entry with the patch, instead there is a line Cowardly refusing to call through null pointer after which the patch makes dc_link_aux_transfer return -1. Code somewhere up the stack attempts 2 retries.
Further to attachment 143008 [details]: there are lots of calls to dm_dp_aux_transfer with aux=00000000f0bfdb41, but the first call with aux=0000000074cc4227 fails (because the aux_engine pointer is NULL). Then a few more calls with 00000000f0bfdb41, 2 more with 0000000074cc4227 and lastly a few with 00000000f0bfdb41 again. Does that pattern jog anyone's memory? Is anyone else reproducing this bug? https://bugs.freedesktop.org/show_bug.cgi?id=108139#c5 mentions the name "Stoney" (chipset(?)) in case that is any help. If no-one else is reproducing this, what would be the most helpful thing I could try next? I don't see this behviour in a VM, so can't gdb it.
Created attachment 143011 [details] [review] [PATCH] drm/amd/display: Limit number of links to num_ddc Can you see if this patch helps you?
Created attachment 143055 [details] dmesg o/p with diags after applying attachment 143011 [details] [review] The line before the BUG line shows a null pointer
Mixed results on applying this patch. IN BRIEF: If you could eliminate this second Oops then we can see what works and what doesn't. In the meantime with the patch applied to v4.20 in the stable repository: reboot *sometimes* works. Ctl-Alt-Del w/out logging in seems not to. Log in as root and issue reboot cmd: no. Well it did work for me a couple of times but I can't seem to be able to do it again. Another thing: I boot to command level. If I let the VC time out (backlight goes off) then I can never wake it again no matter what keys I press. Caps Lock light goes on and off, so keyboard is still active. Hopefully this all gets better once there is no Oops. Attachment 143055 [details] pinpoints the immediate NULL pointer. Again this is a new aux_engine.
Comment on attachment 143011 [details] [review] [PATCH] drm/amd/display: Limit number of links to num_ddc Review of attachment 143011 [details] [review]: ----------------------------------------------------------------- Diagnostics show that this patch has no effect because the compared quantities are always equal
Created attachment 143344 [details] [review] Display connectors_num & res_cap->num_ddc before compare
Restarting investigations at Linux 5.0.0-rc5. Modified attachment 143055 [details] to check whether the patch would trigger. It never would. New patch is attachment 143344 [details] [review].
Created attachment 143346 [details] dmesg o/p showing output from Attachment 143344 [details] This is a typical BUG occurrence. Stack trace looks similar to attachment 143055 [details]. Would it help to add diagnostics as in 143055? Which variables would you want to see?
Here are some notes for anyone trying to reproduce this problem. By "this problem" I mean failure to reboot after "reboot" command issued. 1. On my system (Slackware, no systemd) I am triggering a reboot by Ctl-Alt-Del and this line in /etc/inittab: ca::ctrlaltdel:/sbin/shutdown -t5 -r now 2. I am only booting up to the command line (no X). 3. The occurrence of BUG is intermittent. I am seeing it on about 2 reboots in 3. 4. If there is no BUG, the next reboot will be OK. 5. If a user logs in before Ctl-Alt-Del, reboot with BUG still works maybe 90% of the time. 6. With BUG present, Ctl-Alt-Del at the login prompt succeeds about 50% of the time (measured over 25 reboots). (I put a dmesg command in rc.local to do this test). With e1cb3e4801 reverted, reboot always works and BUG never shows. Were it not for that, I would be suspecting a local hardware problem by now
(In reply to Duncan Roe from comment #23) > Here are some notes for anyone trying to reproduce this problem. > By "this problem" I mean failure to reboot after "reboot" command issued. > > 1. On my system (Slackware, no systemd) I am triggering a reboot by > Ctl-Alt-Del > and this line in /etc/inittab: > ca::ctrlaltdel:/sbin/shutdown -t5 -r now > > 2. I am only booting up to the command line (no X). > > 3. The occurrence of BUG is intermittent. I am seeing it on about 2 reboots > in 3. > > 4. If there is no BUG, the next reboot will be OK. > > 5. If a user logs in before Ctl-Alt-Del, reboot with BUG still works maybe > 90% > of the time. > > 6. With BUG present, Ctl-Alt-Del at the login prompt succeeds about 50% of > the > time (measured over 25 reboots). (I put a dmesg command in rc.local to do > this > test). > > With e1cb3e4 reverted, reboot always works and BUG never shows. Were it > not for that, I would be suspecting a local hardware problem by now
Since Linux 5.1, I do not see this bug any more. So I guess it is "fixed".
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.