Bug 108464 - System fails to reboot after Ctrl-Alt-Del
Summary: System fails to reboot after Ctrl-Alt-Del
Status: RESOLVED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/AMDgpu (show other bugs)
Version: DRI git
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-10-16 21:59 UTC by Duncan Roe
Modified: 2019-08-05 08:02 UTC (History)
3 users (show)

See Also:
i915 platform:
i915 features:


Attachments
kernel config (178.46 KB, text/x-mpsub)
2018-10-16 21:59 UTC, Duncan Roe
no flags Details
dmesg o/p as requested (87.26 KB, text/plain)
2018-10-16 22:41 UTC, Duncan Roe
no flags Details
Patch for Linux-19.0 to revert e1cb3e4 (41.99 KB, patch)
2018-11-03 09:28 UTC, Duncan Roe
no flags Details | Splinter Review
Diagnostic patches to determine which pointer is null (3.78 KB, patch)
2019-01-08 04:26 UTC, Duncan Roe
no flags Details | Splinter Review
dmesg o/p with attachment 143007 (126.83 KB, text/plain)
2019-01-08 04:43 UTC, Duncan Roe
no flags Details
[PATCH] drm/amd/display: Limit number of links to num_ddc (1.14 KB, patch)
2019-01-08 14:01 UTC, Harry Wentland
no flags Details | Splinter Review
dmesg o/p with diags after applying attachment 143011 (151.40 KB, text/plain)
2019-01-10 10:44 UTC, Duncan Roe
no flags Details
Display connectors_num & res_cap->num_ddc before compare (716 bytes, patch)
2019-02-09 00:24 UTC, Duncan Roe
no flags Details | Splinter Review
dmesg o/p showing output from Attachment 143344 (93.69 KB, text/plain)
2019-02-09 01:13 UTC, Duncan Roe
no flags Details

Description Duncan Roe 2018-10-16 21:59:28 UTC
Created attachment 142062 [details]
kernel config

System shuts down after reboot command or Ctrl-Alt_Del but fails to actually reboot. The last line output to the Virtual Console is "Rebooting". The screen then turns off as normal, but fails to turn on again with the boot menu.
Bisection finds this problem is introduced by commit 0a1d56599b9bb58464a8bf1243191eb32b36b694 which patches drivers/gpu/drm/amd/display/amdgpu_dm/amdgpu_dm_debugfs.c
Hardware: as documented in Bug 108139
Comment 1 Alex Deucher 2018-10-16 22:06:19 UTC
Are you sure the bisect is correct?  This commit just changes debugfs which shouldn't be triggered unless you actually write to the file in question.  Please attach your xorg log (is using X) and dmesg output.
Comment 2 Duncan Roe 2018-10-16 22:28:37 UTC
Yes I thought that was weird (debug fs). But adjacent commit 30cdbfaa6aa469347db7fcda5949f1ccf7559ecf does not show the problem
Comment 3 Duncan Roe 2018-10-16 22:30:07 UTC
halt is fine btw, it's only reboot that breaks.
Do you want extra debug turned on for dmesg?
Comment 4 Duncan Roe 2018-10-16 22:41:16 UTC
Created attachment 142063 [details]
dmesg o/p as requested

No Xorg involvement - boot up to command line only
Comment 5 Duncan Roe 2018-10-18 22:50:26 UTC
Still present at 4.19.0-rc8. Is there any other info I can provide?
Comment 6 Duncan Roe 2018-10-24 22:38:00 UTC
(In reply to Duncan Roe from comment #3)
> halt is fine btw, it's only reboot that breaks.
> Do you want extra debug turned on for dmesg?

At Linux 19.0-rc8, power button / halt command also fails. The backlight goes off but the power stays on.
Comment 7 Alex Deucher 2018-10-25 00:06:55 UTC
(In reply to Duncan Roe from comment #6)
> (In reply to Duncan Roe from comment #3)
> > halt is fine btw, it's only reboot that breaks.
> > Do you want extra debug turned on for dmesg?
> 
> At Linux 19.0-rc8, power button / halt command also fails. The backlight
> goes off but the power stays on.

Can you bisect that?  Is it the same commit?
Comment 8 Duncan Roe 2018-10-25 07:40:16 UTC
(In reply to Alex Deucher from comment #1)
> Are you sure the bisect is correct?  This commit just changes debugfs which
> shouldn't be triggered unless you actually write to the file in question. 
> Please attach your xorg log (is using X) and dmesg output.

No longer sure about that. I had been bisecting in anongit.freedesktop.org/drm/drm. When I switched to git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git, both 0a1d565 and 30cdbfa show the problem. So I can now try to bisect further.
Comment 9 Duncan Roe 2018-10-25 07:46:39 UTC
(In reply to Alex Deucher from comment #7)
> (In reply to Duncan Roe from comment #6)
> > (In reply to Duncan Roe from comment #3)
> > > halt is fine btw, it's only reboot that breaks.
> > > Do you want extra debug turned on for dmesg?
> > 
> > At Linux 19.0-rc8, power button / halt command also fails. The backlight
> > goes off but the power stays on.
> 
> Can you bisect that?  Is it the same commit?

It is the same commit i.e. 0a1d565 built from the stable tree consistently fails to power off. 30cdbfa *sometimes* fails to power off - I think I have seen 2 fails in6 reboots (my spreadsheet isn't set up to count results (yet)).
Comment 10 Duncan Roe 2018-10-27 23:50:57 UTC
(In reply to Alex Deucher from comment #1)
> Are you sure the bisect is correct?  This commit just changes debugfs which
> shouldn't be triggered unless you actually write to the file in question. 
> Please attach your xorg log (is using X) and dmesg output.

A fresh bisect in the stable tree has given a new pair of commits. e1cb3e4801e6896ba93d63222b1052199d2a8c9b has the problem and 899e2aaddbfa0ff96fbaf31f0d9e91427e87dd88 does not have it. (In the stable tree, 30cdbfaa6aa469347db7fcda5949f1ccf7559ecf also has the problem, unlike in the drm tree that I bisected previously).
A diff of the new commit pair shows 14 patched files. These are TODO, Makefile and a mixture of .c and .h sources. I am unsure how to proceed with bisecting these diffs: advice welcome.
Comment 11 Duncan Roe 2018-10-29 05:24:09 UTC
There is a kernel Oops associated with this problem. I only just discovered it started on the same commit as reboot failure did. You can see the BUG line in attachment 142063 [details] at time 5.075194
Comment 12 Duncan Roe 2018-11-03 09:28:06 UTC
Created attachment 142354 [details] [review]
Patch for Linux-19.0 to revert e1cb3e4

Revert commit e1cb3e4801e6896ba93d63222b1052199d2a8c9b (drm/amd/display: Convert remaining loggers off dc_logger).
Reboot works again and the BUG / Oops is gone.
Comment 13 Duncan Roe 2019-01-08 04:26:37 UTC
Created attachment 143007 [details] [review]
Diagnostic patches to determine which pointer is null

These patches are against Linux 4.19.12, commit 2a7cb228d29c3882c1414c10a44c5f3f59bfa44d in
git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git
Comment 14 Duncan Roe 2019-01-08 04:43:04 UTC
Created attachment 143008 [details]
dmesg o/p with attachment 143007 [details] [review]

The exception occurs in dc_link_aux_transfer, which is called by dm_dp_aux_transfer which is the top displayed function on the stack after the BUG line in attachment 142063 [details]. There is no BUG entry with the patch, instead there is a line
Cowardly refusing to call through null pointer
after which the patch makes dc_link_aux_transfer return -1.
Code somewhere up the stack attempts 2 retries.
Comment 15 Duncan Roe 2019-01-08 05:01:38 UTC
Further to attachment 143008 [details]: there are lots of calls to dm_dp_aux_transfer with aux=00000000f0bfdb41, but the first call with aux=0000000074cc4227 fails (because the aux_engine pointer is NULL). Then a few more calls with 00000000f0bfdb41, 2 more with 0000000074cc4227 and lastly a few with 00000000f0bfdb41 again. Does that pattern jog anyone's memory?
Is anyone else reproducing this bug?
https://bugs.freedesktop.org/show_bug.cgi?id=108139#c5 mentions the name "Stoney" (chipset(?)) in case that is any help.
If no-one else is reproducing this, what would be the most helpful thing I could try next? I don't see this behviour in a VM, so can't gdb it.
Comment 16 Harry Wentland 2019-01-08 14:01:18 UTC
Created attachment 143011 [details] [review]
[PATCH] drm/amd/display: Limit number of links to num_ddc

Can you see if this patch helps you?
Comment 17 Duncan Roe 2019-01-10 10:44:24 UTC
Created attachment 143055 [details]
dmesg o/p with diags after applying attachment 143011 [details] [review]

The line before the BUG line shows a null pointer
Comment 18 Duncan Roe 2019-01-10 11:04:12 UTC
Mixed results on applying this patch.
IN BRIEF: If you could eliminate this second Oops then we can see what works and what doesn't.
In the meantime with the patch applied to v4.20 in the stable repository:
reboot *sometimes* works. Ctl-Alt-Del w/out logging in seems not to. Log in as root and issue reboot cmd: no. Well it did work for me a couple of times but I can't seem to be able to do it again.
Another thing: I boot to command level. If I let the VC time out (backlight goes off) then I can never wake it again no matter what keys I press. Caps Lock light goes on and off, so keyboard is still active.
Hopefully this all gets better once there is no Oops.
Attachment 143055 [details] pinpoints the immediate NULL pointer. Again this is a new aux_engine.
Comment 19 Duncan Roe 2019-02-09 00:12:49 UTC
Comment on attachment 143011 [details] [review]
[PATCH] drm/amd/display: Limit number of links to num_ddc

Review of attachment 143011 [details] [review]:
-----------------------------------------------------------------

Diagnostics show that this patch has no effect because the compared quantities are always equal
Comment 20 Duncan Roe 2019-02-09 00:24:51 UTC
Created attachment 143344 [details] [review]
Display connectors_num & res_cap->num_ddc before compare
Comment 21 Duncan Roe 2019-02-09 00:27:33 UTC
Restarting investigations at Linux 5.0.0-rc5. Modified attachment 143055 [details] to check whether the patch would trigger. It never would. New patch is attachment 143344 [details] [review].
Comment 22 Duncan Roe 2019-02-09 01:13:38 UTC
Created attachment 143346 [details]
dmesg o/p showing output from Attachment 143344 [details]

This is a typical BUG occurrence. Stack trace looks similar to attachment 143055 [details].
Would it help to add diagnostics as in 143055? Which variables would you want to see?
Comment 23 Duncan Roe 2019-02-11 23:36:08 UTC
Here are some notes for anyone trying to reproduce this problem.
By "this problem" I mean failure to reboot after "reboot" command issued.

1. On my system (Slackware, no systemd) I am triggering a reboot by Ctl-Alt-Del
   and this line in /etc/inittab:
   ca::ctrlaltdel:/sbin/shutdown -t5 -r now

2. I am only booting up to the command line (no X).

3. The occurrence of BUG is intermittent. I am seeing it on about 2 reboots in 3.

4. If there is no BUG, the next reboot will be OK.

5. If a user logs in before Ctl-Alt-Del, reboot with BUG still works maybe 90%
   of the time.

6. With BUG present, Ctl-Alt-Del at the login prompt succeeds about 50% of the
   time (measured over 25 reboots). (I put a dmesg command in rc.local to do this
   test).

With e1cb3e4801 reverted, reboot always works and BUG never shows. Were it not for that, I would be suspecting a local hardware problem by now
Comment 24 Duncan Roe 2019-02-11 23:42:06 UTC
(In reply to Duncan Roe from comment #23)
> Here are some notes for anyone trying to reproduce this problem.
> By "this problem" I mean failure to reboot after "reboot" command issued.
> 
> 1. On my system (Slackware, no systemd) I am triggering a reboot by
> Ctl-Alt-Del
>    and this line in /etc/inittab:
>    ca::ctrlaltdel:/sbin/shutdown -t5 -r now
> 
> 2. I am only booting up to the command line (no X).
> 
> 3. The occurrence of BUG is intermittent. I am seeing it on about 2 reboots
> in 3.
> 
> 4. If there is no BUG, the next reboot will be OK.
> 
> 5. If a user logs in before Ctl-Alt-Del, reboot with BUG still works maybe
> 90%
>    of the time.
> 
> 6. With BUG present, Ctl-Alt-Del at the login prompt succeeds about 50% of
> the
>    time (measured over 25 reboots). (I put a dmesg command in rc.local to do
> this
>    test).
> 
> With e1cb3e4 reverted, reboot always works and BUG never shows. Were it
> not for that, I would be suspecting a local hardware problem by now
Comment 25 Duncan Roe 2019-08-05 08:02:43 UTC
Since Linux 5.1, I do not see this bug any more.
So I guess it is "fixed".


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.