106671 – Frequent lock ups for AMD RX 550 graphics card

Bug 106671 - Frequent lock ups for AMD RX 550 graphics card

Summary: Frequent lock ups for AMD RX 550 graphics card

Status:	RESOLVED MOVED

Alias:	None

Product:	Mesa
Classification:	Unclassified
Component:	Drivers/Gallium/radeonsi (show other bugs)
Version:	18.0
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium normal
Assignee:	Default DRI bug account
QA Contact:	Default DRI bug account

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2018-05-26 19:08 UTC by Alan W. Irwin
Modified:	2019-09-25 18:03 UTC (History)
CC List:	0 users

See Also:
i915 platform:
i915 features:

Attachments
X log file as requested (43.10 KB, text/plain) 2018-05-28 18:07 UTC, Alan W. Irwin	Details
dmesg output as requested (83.04 KB, text/plain) 2018-05-28 18:08 UTC, Alan W. Irwin	Details
tarball containing kern.log, syslog, and dmesg output (69.78 KB, application/x-gtar-compressed) 2018-09-04 19:26 UTC, Alan W. Irwin	Details
compressed dmesg output from current direct graphics experiment (18.67 KB, application/x-gtar-compressed) 2018-09-07 22:36 UTC, Alan W. Irwin	Details
log files from latest logup (108.46 KB, application/x-gtar-compressed) 2018-09-14 23:44 UTC, Alan W. Irwin	Details
tarball containing log information concerning latest lockup (102.52 KB, application/x-gtar-compressed) 2018-09-24 04:44 UTC, Alan W. Irwin	Details
tarball containing kern.log, syslog, and dmesg output (176.41 KB, application/x-gtar-compressed) 2018-09-24 19:35 UTC, Alan W. Irwin	Details
tarball containing daemon.log, messages, kern.log, syslog, and dmesg output (181.13 KB, application/x-gtar-compressed) 2018-10-04 05:53 UTC, Alan W. Irwin	Details
X log file showing segfault (31.81 KB, text/plain) 2018-10-05 01:39 UTC, Alan W. Irwin	Details
tarball containing daemon.log, messages, kern.log, syslog, and dmesg output (107.13 KB, application/x-gtar-compressed) 2018-10-17 21:28 UTC, Alan W. Irwin	Details
tarball containing log information concerning latest lockup (218.78 KB, application/x-gtar-compressed) 2018-10-21 05:58 UTC, Alan W. Irwin	Details
tarball containing log information concerning latest lockup (291.44 KB, application/x-gtar-compressed) 2018-11-04 05:56 UTC, Alan W. Irwin	Details
View All

Description Alan W. Irwin 2018-05-26 19:08:05 UTC

This new computer has passed a fairly stringent non-graphics hardware test (an
rsync of 0.5TB of files followed by a second rsync using the --checksum
option which showed every bit was identical between the external drive source
and internal drive destination of the rsync) and also if I mostly (except
for the Linux console login prompt) bypass the RX 550 for display by running
applications on the new computer while displaying on my old computer's X server, there are no lock ups so far in this mode.  So these are pretty strong indications that the frequent (up to three times per day) lock ups I experience when using the RX 550 for X graphical display are due to some bug in the amdgpu X driver or its Linux kernel support.

I have also reported this issue at <https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=900087> which includes more detail concerning software
versions, X logs, etc. for my Debian Buster versions of everything.

Of course, Debian Buster is likely a bit behind the cutting edge for both kernel and X.  So to check if the latest kernel and X already solve the lock up problems I am having with the Buster version I assume I will have to build both kernel and X.  I do have general experience building Linux software but it has been years since I built a Linux kernel and I have never built X before.  So specific advice on how to build the Linux kernel and X and especially integrating those built results into Debian Buster without messing up the system would be most appreciated.

In the event that the latest kernel and X does not solve this issue, then I would be ready to do any additional tests you might like to help you narrow down what is the cause of these lock ups.

Comment 1 Michel Dänzer 2018-05-28 08:47:31 UTC

Please attach the corresponding full Xorg log and dmesg output.

This is most likely between Mesa and the kernel; xf86-video-amdgpu doesn't contain any GPU specific rendering code which could cause hangs. I'd recommend trying latest upstream versions of Mesa (18.1) and the kernel, and if it still happens, also try getting the current microcode files from https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/amdgpu .

Comment 2 Alan W. Irwin 2018-05-28 18:07:30 UTC

Created attachment 139816 [details]
X log file as requested

Comment 3 Alan W. Irwin 2018-05-28 18:08:14 UTC

Created attachment 139817 [details]
dmesg output as requested

Comment 4 Alan W. Irwin 2018-05-28 18:32:20 UTC

Hi Michel:

I have added your requested attachments.  And if there are other data you need or other tests I can run, let me know.

Meanwhile, all else seems well with this new computer (e.g., the lock ups are gone under my normal KDE desktop use since I bypassed using this card 3 days ago by displaying my desktop on an X server running on a different computer.  But that is only a temporary workaround (another person needs to use that other computer's display and keyboard/mouse).  Therefore, I need the RX 550 to work reliably on my new computer which is why I will be following your recommendations with regard to trying the latest kernel, mesa, and (if all else fails) firmware.  But building kernel and mesa is going to take me considerable time for the reasons I mentioned in my original post.

Comment 5 Alan W. Irwin 2018-06-02 16:51:26 UTC

Hi Michel:
Since the lock ups occurred during ordinary (KDE) desktop use when I wasn't running 3D games I ignored mesa upgrades and instead concentrated first on trying a new kernel version (from 4.16.5 to 4.16.12 because 4.16.12 had conveniently just been propagated from Debian Sid to Buster).  And so far it appears that upgrade makes a large improvement for the RX 550.  Previously for 4.16.5 the uptimes before a lock up occurred ranged from 7 hours to 2 days, but right now with heavy desktop use and a substantial number of runs of the 3D game, I haven't experienced a single lock up with 4.16.12 with current uptime since I booted 4.16.12 approaching 3 days.
You may well conclude "problem already solved", but I normally run my computer 24/7 with reboots only when absolutely necessary.  Therefore I would like to keep this bug report open for a while just to report the maximum uptimes (hopefully at least several months) I can achieve with this graphics card.

Comment 6 Alan W. Irwin 2018-06-02 16:54:11 UTC

Edit of previous comment:

of the 3D game, -> of the 3D game, foobillard,

Comment 7 Alan W. Irwin 2018-08-29 23:25:07 UTC

Please remove the resolution of this bug as FIXED.

The reason for this request is subsequent kernel-4.16.x use after the initial success I reported continued to show lock ups whenever this graphics card was used.  Yesterday, I tried kernel-4.17.17-1 from Debian Buster (the first time I had tried any kernel-4.17.x version) in great anticipation these kernel lockups would be fixed (since kernel-4.17.x apparently contains lots of AMD graphics fixes).  But when I used this graphics card for ordinary direct desktop use (as opposed to accessing my desktop on the new computer via an X-terminal which is so far the only stable way I can use my new computer), I got a lockup within a half hour or so followed by one roughly 8 hours later.  For what it is worth, I have also installed mesa-8.1.6-1 and version 20180518-1 of the firmware-amd-graphics package from Debian Buster before performing this failing experiment.

So it appears the substantial number of AMD graphics fixes in kernel-4.17.x and mesa-18.1.y and installation of the relatively recent (from May) Debian Buster firmware-amd-graphics package are not sufficient to stabilize use of this AMD RX 550 graphics card.  That is a big disappointment since this card should no longer be considered cutting-edge hardware (i.e. it was first offered for sale at least 16 months ago) and this delay in fixing it cannot be attributed to non-cooperation from AMD since they appear to have a good open-source record.

Because of these on-going issues with direct use of this card, I am
going back to using the X-terminal method with this kernel which
experience with kernel-4.16.x shows is much more stable since it
avoids using this graphics card completely (except for the direct
display of the Linux console login prompt).

I plan to again try the experiment of attempting to use this card directly when kernel-4.18.x is promoted to Buster.  But meanwhile, if you have any other suggestions I could try, please let me know.

Comment 8 Alan W. Irwin 2018-09-04 19:26:41 UTC

Created attachment 141451 [details]
tarball containing kern.log, syslog, and dmesg output

Comment 9 Alan W. Irwin 2018-09-04 19:27:43 UTC

We (there are two of us using this machine) just got yet another kernel lockup (no remote access possible with ssh, direct keyboard not working), but this is a case when we were remotely accessing this box with an X-terminal.  In other words, the only use of the RX 550 was to display the command-line login prompt for the Linux console of the directly attached monitor until the lockup where it displayed the following message (roughly 15 times in the half-hour before I got out of the lockup by pushing the reset button.)

watchdog: BUG: soft lockup - CPU#12 stuck for 22s! [firefox-esr:29266]

(At the time we were both browsing different sites with firefox with one of those firefox instances running a couple of days, and as a security measure we both restrict the use of javascript with the noscript extension to firefox.)

I have attached a tarball containing log files (kern.log and syslog) that contain the lockup information (including the above message) as well as information about the fresh boot afterwards.  (For what it is worth, that tarball also includes dmesg output which appears to contain information only about the fresh boot.)

For this minimal use case for the RX 550, the Linux kernel lasted 6 days before the lockup which is much better than the direct use case where the lockups can occur as soon as a half hour after a fresh boot.  So the current lockup could be due to an entirely different bug than in the lockups I have encountered for the direct use case.  But, of course, minimal use is not zero use so currently I ascribe both the present remote-use lockup and the previous direct-use lockups to some incompatibility between the RX 550 and the Debian Testing graphics stack.  That stack currently includes the following component versions:

linux-image-4.17.0-3-amd64                    4.17.17-1
firmware-amd-graphics                         20180518-1
libdrm-amdgpu1:amd64                          2.4.93-1
libglapi-mesa:amd64                           18.1.6-1
xserver-xorg-video-amdgpu                     18.0.1-1+b1

Please let me know if there are any other data you need or any experiments you would like me to try.  In any case I plan to continue with remote use of this box while reporting lockup incidents as they occur.  But I also plan to try direct use again whenever one of the components of the above stack gets significantly upgraded for Debian Testing.

Comment 10 Michel Dänzer 2018-09-05 07:37:30 UTC

(In reply to Alan W. Irwin from comment #9)
> So the current lockup could be due to an entirely different bug than in the
> lockups I have encountered for the direct use case.

Yeah, that looks like an RCU or other core kernel issue, not directly related to the graphics drivers (which as you say, aren't really being used in this case).

Does idle=nomwait on the kernel command line help for any of these issues, by any chance?

It's also worth making sure the motherboard BIOS is up to date.

Comment 11 Alan W. Irwin 2018-09-06 01:26:14 UTC

Thanks for that idle=nomwait suggestion which I have now just tried (verified by

irwin@merlin> cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-4.17.0-3-amd64 root=UUID=1e45a1ee-a5d6-4327-9a7b-2663ffc0b157 ro rootwait quiet idle=nomwait

) and I now indeed have a stable result.  However, that is currently just for the last 5 minutes in remote access mode.  :-)  So we will see how this goes for, say, the next two weeks, to see if I can beat my last 4.17.17 remote access uptime record of 6 days.

With regard to your MB BIOS update suggestion, I am going to hold back on that for a while since the techs from a local computer company that assembled my box in May felt such updates were dangerous and therefore a last resort.  And that is also the consistent advice I have gotten for the other 3 Linux boxes I have had assembled for me since I started using Linux in 1996.  Of course, this year may be a special case with all the Meltdown (although not for this AMD hardware) and many variants of SPECTRE out there so I do plan to update the BIOS within the next couple of months on the assumption that the SPECTRE BIOS mitigations recommended by AMD to ASUS for this hardware (PRIME B350+ MB with AMD Ryzen 7 1700 CPU, 64GB RAM, and ASUS RX 550 graphics card) will have matured by then.  

But before I implement that planned BIOS update, I am hoping that the current cutting-edge Linux graphics stack (which according to a senior Phoronix poster works well for the RX 560) will also give me stable direct-display results for the RX 550 once that version of the graphics stack propagates to Debian Testing.  I estimate that propagation time will be a couple of more months based on how quickly elements of the cutting-edge Linux graphics stack such as the kernel has propagated in the past from upstream to Debian Testing.

In sum, it is a waiting game now to see if your idle=nomwait suggestion restores the complete Linux stability I was used to with my old box (for Debian Oldstable = Jessie) for at least the remote display case, and if that stability is obviously much better (i.e., at least a couple of weeks uptime with no lockups) then I will try the direct display case again with idle=nomwait to see if it makes that case stable as well.

Thanks, Michel, for your on-going helpful suggestions for dealing with this troubling instability issue (these troubling instability issues?) for my new Linux box.

Alan

Comment 12 Alan W. Irwin 2018-09-06 06:56:27 UTC

(In reply to Michel Dänzer from comment #10)
> [T]hat looks like an RCU or other core kernel issue, not directly
> related to the graphics drivers.

Hi Michel:
If so, should I report that probable non-graphics kernel bug (with my crash-report tarball) elsewhere?  Or do you suggest I just forget it until I see what are the remote graphics results of idle=nomwait over the course of the next couple of weeks AND (if that is a success) the direct graphics results of idle=nomwait for a couple of more weeks after that?

Comment 13 Alan W. Irwin 2018-09-07 22:23:35 UTC

Well, after 1.5 (successful) days with the remote graphics experiment, I decided instead it made more sense to go after the quicker acting instability that I have previously experienced in direct graphics mode.  So just now I have started
a direct graphics experiment after a Debian Testing upgrade which included the following firmware and mesa changes:

firmware-amd-graphics updated "(20180825+dfsg-1) over (20180518-1)"

mesa updated "(18.1.7-1) over (18.1.6-1)"

In addition for this experiment I installed the  amd64-microcode package
that contains "microcode patches for all AMD AMD64 processors".

Also, as part of this experiment I have continued with the idle=nomwait kernel parameter as verified by 

irwin@merlin> cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-4.17.0-3-amd64 root=UUID=1e45a1ee-a5d6-4327-9a7b-2663ffc0b157 ro rootwait quiet idle=nomwait

N.B. note those kernel parameters do not include any amdgpu-related parameters.  Do you recommend any such parameters for the RX 550 such as amdgpu.dc=1 which is sometimes recommended for older versions of AMD new-generation graphics hardware?

Comment 14 Alan W. Irwin 2018-09-07 22:36:01 UTC

Created attachment 141479 [details]
compressed dmesg output from current direct graphics experiment

Comment 15 Alan W. Irwin 2018-09-14 23:44:52 UTC

Created attachment 141567 [details]
log files from latest logup

Comment 16 Alan W. Irwin 2018-09-15 00:04:12 UTC

I was beginning to have some hope that the latest direct access experiment would prove to be stable.  However, just now it locked up again after almost 7 days.  So the stability is substantially improved compared to before, and my guess is that improvement is due to installation of the amd64-microcode package from Debian Buster for this latest experiment.  
However, this is still disappointing stability because typically for truly stable systems I achieve up times of 30 days or longer with the only limit on uptime being how often I have to reboot due to kernel upgrades.
I have attached a crash report tarball containing dmesg output as well as various log files that captured all log activity before the lockup and the boot afterward.  I don't see anything concerning the crash in those log files, but I may be missing something since I am no expert so I would appreciate it if you
took a look.
I have restarted exactly the same direct graphics access test again (with same versions of graphics stack packages and your recommended idle=nomwait kernel parameter in hopes that the kernel will last longer this time before the lockup and/or I catch more details of the lockup when it occurs.  If you would prefer me to try a different variant of this test, please let me know.

Comment 17 Alan W. Irwin 2018-09-15 00:52:58 UTC

I terminated the last test immediately because it turns out a new kernel (Linux merlin 4.18.0-1-amd64 #1 SMP Debian 4.18.6-1 (2018-09-06) x86_64 GNU/Linux) has propagated from Debian Unstable to Debian Testing = Buster so I will use that kernel for my new test.  On boot with this new kernel the usual blast of random color on the Linux console displayed by the RX 550 that I am used to for all previous kernel versions is now gone.  So that is a positive step in the right direction, and I hope that means the Debian Buster graphics stack is finally completely stable for the RX 550, but I will test that hypothesis with this latest test.  

The latest Debian Buster graphics stack versions for this direct graphics kernel stability test for the RX 550 are as follows:

linux-image-4.18.0-1-amd64                    4.18.6-1
amd64-microcode                               3.20180524.1
firmware-amd-graphics                         20180825+dfsg-1
libdrm-amdgpu1:amd64                          2.4.94-1
libglapi-mesa:amd64                           18.1.7-1
xserver-xorg-video-amdgpu                     18.0.1-1+b1

Here are my kernel parameters which includes the suggested idle=nomwait:
irwin@merlin> cat /proc/cmdline
BOOT_IMAGE=/boot/vmlinuz-4.18.0-1-amd64 root=UUID=1e45a1ee-a5d6-4327-9a7b-2663ffc0b157 ro rootwait quiet idle=nomwait

Comment 18 Alan W. Irwin 2018-09-24 04:44:53 UTC

Created attachment 141706 [details]
tarball containing log information concerning latest lockup

Comment 19 Alan W. Irwin 2018-09-24 05:21:18 UTC

Despite a new kernel, this instability issue has continued. Kernel 4.18.6 locked up after 8+ days of up time on our principal computer that has the RX 550 graphics card installed. (I will refer to this computer as the "new" computer, our other working Linux computer that is used to display X results from the new computer as the X-terminal, and our old principal computer (powered down permanently now) as the "old" computer.) The lockup of the new computer occurred some time in the early morning and (since two users use this machine at one time) with one inactive XFCE desktop being displayed on our X-terminal and one inactive XFCE desktop being displayed directly on the new computer. The only symptom of the lockup I could spot in the log files was a burst of null bytes in each log file. For what it is worth that symptom is new. See the attached crash_report_20180923.tar.gz for the log file and dmesg details.

This result of 8+ days of up time for direct graphics desktop use of the new computer is slightly better than the almost 7 days of up time achieved for the previous similar test for kernel 4.17.7. Although the present up time result at least encourages further testing with kernel 4.18.x, this is only one test, and the next test might give a substantially shorter or longer up time. In any case this result is still far from ideal since such lockups never occurred on the old computer that this new computer replaced and also do not currently occur for the X-terminal. That is, on the old principal box up times exceeding 30 days have been common and similarly on the X-terminal, and the only reason I rebooted in those cases was power interruptions or the installation of a new kernel. For the present case of the new box, the lockups mean the only recovery possible is to hit the reset button with all that implies about journal recovery and potential file deletion for files that are in inconsistent shape due to the lockup.

For what it is worth, the lockup symptoms this time were a bit different than before. The new computer had a frozen display (rather than blank before), and frozen mouse and keyboard (as before). The X-terminal used to remotely access a desktop running on the new computer had a frozen display (rather than blanked) with working keyboard (and maybe mouse, but I didn't record that) so I could exit the local X and get to the Linux console where ping to the new computer actually worked (as opposed to ping not working at all for the previous lockup). So because networking was working, ssh to the new computer didn't time out. However, it ran for 20+ minutes with no sign of a login so the net result was the same as for previous lockups; there was no way to login to the new computer from another computer to shut down the new computer normally so the only method of shutting it down was to hit the reset button.

Comment 20 Alan W. Irwin 2018-09-24 05:34:53 UTC

I started a new stability test as of 2018-09-23 15:34:19 right after a Debian Buster dist-upgrade.  The graphics stack versions for this test are as follows:

ii  amd64-microcode                               3.20180524.1                   amd64        Processor microcode firmware for AMD CPUs
ii  firmware-amd-graphics                         20180825+dfsg-1                all          Binary firmware for AMD/ATI graphics chips
ii  libdrm-amdgpu1:amd64                          2.4.94-1                       amd64        Userspace interface to amdgpu-specific kernel DRM services -- runtime
ii  libglapi-mesa:amd64                           18.1.7-1                       amd64        free implementation of the GL API -- shared library
ii  linux-image-4.18.0-1-amd64                    4.18.6-1                       amd64        Linux 4.18 for 64-bit PCs
ii  xserver-xorg-video-amdgpu                     18.1.0-1                       amd64        X.Org X server -- AMDGPU display driver

That is, these versions are identical to the previous test other than
the (substantial) update of the AMDGPU display driver from version 18.0.1-1+b1
to version 18.1.0-1.

The kernel parameters were the same as the previous test, e.g.,

BOOT_IMAGE=/boot/vmlinuz-4.18.0-1-amd64 root=UUID=1e45a1ee-a5d6-4327-9a7b-2663ffc0b157 ro rootwait quiet idle=nomwait

Comment 21 Alan W. Irwin 2018-09-24 19:35:15 UTC

Created attachment 141724 [details]
tarball containing kern.log, syslog, and dmesg output

Comment 22 Alan W. Irwin 2018-09-24 19:58:45 UTC

This last stability test lasted only 17.5 hours before the lockup. See the latest attached tarball for the relevant log files (which capture everything during this short up time) and dmesg output. As far as I can tell there is nothing in those log files relevant to the lockup, e.g., no burst of null ascii characters like what occurred in the log files for the previous experiment. There are some segfaults associated with a cron task I have configured every morning starting at 4:32, but those always occur for that task (which is a complete build and test of CMake) so I don't think they are relevant.

The actual lockup today happened with one inactive desktop running on the X-terminal and one active desktop running on the new box. (I was editing a file with Emacs.) Also, the symptoms of this lockup were more severe, i.e., ping
did not work from the X-terminal to the new box. But as always there was no way to shut down the new box properly so I had to do that with the reset button.

Since I bought the new box in May remote access from an X-terminal has only locked up twice (one of those detailed here), and after a relatively long period of time. So tests where the X-terminal use is the only way to access the new box seems in general much more stable than direct use (as in the present case with such a short time before the lockup). And I haven't tried sole use of the X-terminal for a while now, and that may be completely stable with the new kernel. So my conclusion remains that the problem is associated with the Debian Buster graphics stack (and likely also the very latest graphics stack if someone will do some up time tests for modern AMD graphics cards for that stack) used to display and control the RX 550 card on the new box.

I have now started a new test (as of 9:08:19 today) with all graphics stack versions and kernel parameters the same as for the previous test in hopes that when the inevitable lockup comes the log files will be more informative. Please let me know if you have some other experiment you would like me to try.

Comment 23 Alan W. Irwin 2018-10-04 05:53:01 UTC

Created attachment 141872 [details]
tarball containing daemon.log, messages, kern.log, syslog, and dmesg output

The previously described uptime test lasted (until the lockup this morning) for 9+ days, but the log files included nothing that seemed relevant.   The next uptime test that started this morning for exactly the same graphics stack and kernel parameters lasted only 7 hours until a lockup, and this time the (attached) log files caught substantial error messages before the crash.  

@Michel Dänzer:  Could you please take a look at this one to see whether there is some clue in the kernel error messages concerning the source of this instability?

Comment 24 Alan W. Irwin 2018-10-05 01:39:38 UTC

Created attachment 141904 [details]
X log file showing segfault

Just now the direct X server failed with a segfault (see the attached log file for the details).  I restarted direct X with my normal startx method, and my kernel stability test I started yesterday with two different desktops running (one with direct X and one for a different user who uses an X-terminal) is continuing.  I also reviewed the previous tarball that contained the log files for the last kernel lockup, and the messages there have a lot to say about NMI so I am hoping if some expert here actually takes a look at those log files, and/or the attached X log file containing messages from X server segfault), they might be able to find a way to increase stability for the RX 550 or might recommend some variation on these stability experiments to get a better idea of the graphics stack bug(s) that are causing this (these) issue(s).

Comment 25 Michel Dänzer 2018-10-05 08:40:37 UTC

(In reply to Alan W. Irwin from comment #24)
> Just now the direct X server failed with a segfault (see the attached log
> file for the details).

Looks like a Mesa bug. Please install the libgl1-mesa-dri-dbgsym package and attach another log file if it happens again.

Comment 26 Alan W. Irwin 2018-10-05 09:32:29 UTC

(In reply to Michel Dänzer from comment #25)
> (In reply to Alan W. Irwin from comment #24)
> > Just now the direct X server failed with a segfault (see the attached log
> > file for the details).
> 
> Looks like a Mesa bug. Please install the libgl1-mesa-dri-dbgsym package and
> attach another log file if it happens again.

Good idea, but I cannot follow up on it.
Debian Jessie = oldstable had debug packages for libgl1-mesa-dri, but I
can find nothing equivalent for Debian Stretch (not relevant to my Debian Buster box but I looked for it nevertheless) or Debian Buster.  Debian Sid has such packages, but they are all for non-official hardware platforms and not for
my AMD64 hardware platform.

Is there any further follow up you can recommend for the NMI-related error messages for the latest kernel lockup?

Comment 27 Michel Dänzer 2018-10-05 09:59:08 UTC

(In reply to Alan W. Irwin from comment #26)
> Debian Jessie = oldstable had debug packages for libgl1-mesa-dri, but I
> can find nothing equivalent for Debian Stretch

Debugging symbol packages are in a separate repository now, add this to /etc/apt/sources.list:

deb	https://deb.debian.org/debian-debug/	<suite>-debug	main contrib non-free

(Replace <suite> with the suite name you have for the main repository there)


> Is there any further follow up you can recommend for the NMI-related error
> messages for the latest kernel lockup?

Looks e1000e network driver related.

Comment 28 Alan W. Irwin 2018-10-06 00:32:05 UTC

(In reply to Michel Dänzer from comment #27)
> (In reply to Alan W. Irwin from comment #26)
> > Debian Jessie = oldstable had debug packages for libgl1-mesa-dri, but I
> > can find nothing equivalent for Debian Stretch
> 
> Debugging symbol packages are in a separate repository now, add this to
> /etc/apt/sources.list:
> 
> deb	https://deb.debian.org/debian-debug/	<suite>-debug	main contrib non-free
> 
> (Replace <suite> with the suite name you have for the main repository there)

Thanks for that Debian Buster help concerning debug symbols.  As a result I now have libgl1-mesa-dri-dbgsym installed just in case I run into this segfault again.

 
> > Is there any further follow up you can recommend for the NMI-related error
> > messages for the latest kernel lockup?
> 
> Looks e1000e network driver related.

So this appears to be a side issue from the much more frequent lockups I tend to get whenever I am using the RX 550 card.  So it is off-topic for the current bug report, but thanks for helping me to determine that by your classification of this particular source of kernel-4.18.x lockups for my now 5 months old and still not stable Linux box.

Anyhow, I will continue the present stability experiment to see how far I get before the next lockup.

Comment 29 Alan W. Irwin 2018-10-17 21:28:15 UTC

Created attachment 142075 [details]
tarball containing daemon.log, messages, kern.log, syslog, and dmesg output

Comment 30 Alan W. Irwin 2018-10-17 21:54:37 UTC

This time the system lasted almost 14 days before the lockup.  See the latest attachment for the log details which contain NMI messages followed by a burst of ascii null characters (which in my experience can be due to different threads or processes trying to write to the same file, i.e., the NMI error messages themselves might have exposed another kernel bug).  Unlike the last case of NMI mesages where an Intel network card was mentioned, the only hardware I can see
mentioned in these messages is a particular cpu and my motherboard, e.g.,
Oct 17 13:25:02 merlin kernel: [1177237.021995] NMI watchdog: Watchdog detected hard LOCKUP on cpu 13
[...]
Oct 17 13:25:02 merlin kernel: [1177237.022042] Hardware name: System manufacturer System Product Name/PRIME B350-PLUS, BIOS 3803 01/22/2018

So this appears not to be hard evidence of a graphics stack bug since likely any linux system component bug could lock up a cpu, but I am still pretty sure this is a graphics stack issue with the RX 550 because of my prior evidence showing
much better kernel stability if I do not use that RX550 card at all.

I started a new up-time experiment using today's snapshot of Debian Buster which left most of the graphics stack the same other than libdrm-amdgpu1 which has been updated from 2.4.94-1 to 2.4.95-1 and the 
linux kernel which has been updated from 4.18.6-1 to 4.18.10-2.

Comment 31 Alan W. Irwin 2018-10-21 05:58:26 UTC

Created attachment 142114 [details]
tarball containing log information concerning latest lockup

Comment 32 Alan W. Irwin 2018-10-21 06:20:46 UTC

I had another lockup today after ~3 days of uptime.  Please see the most recent attachment for the relevant log files and dmesg output corresponding to this lockup.  These logs contain NMI messages and references to the e1000e kernel module so, although I am no expert, this lockup appears to be e1000e related.  

That kernel module is running the following Intel networking expansion card:

09:00.0 Ethernet controller: Intel Corporation 82574L Gigabit Network Connection

That card is used to connect (via crossover cable) this Ryzen 7 box with an external  X-terminal box.  Since this is the second lockup recently attributed to the e1000e module and other recent lockups I have encountered with no error messages or ascii null error messages in the system logs could also be due to this kernel module, the current stability test I just started was to turn off the X terminal completely (to eliminate use of the 82574L other than its initial detection).  The second user of the present system that previously accessed it with the X-terminal is now accessing locally it with "startx -- :1" while I am accessing it locally with "startx".  So the two users are sharing one monitor, keyboard, and mouse and switching between their two xfce desktops and associated local X servers using the appropriate ctrl-alt-FN keyboard shortcuts.  So although this is a painful way to run our two desktops it obviously is a more stringent test and also a much cleaner test (without the e1000e troubles confusing graphics stack issues) of how stable the Debian Buster graphics stack is for my Ryzen 7 1700 system with 64GB, (idle) Intel 82574L networking card, and (extremely busy since it is switched between two X servers several times per day) AMD RX 550.

In sum, my hope is that all the other package upgrades and installations (e.g., of the firmware packages) I have done have completely stabilized the Debian Buster graphics stack for my Ryzen 7 system with RX 550 so that I will get an uptime (with this painful but useful experiment with two local X servers) of at least several months which would allow me to close this bug report as fixed.

@Michel Dänzer: Meanwhile, where is the best upstream place to report the repeated lockups with the e1000e?  I have already created a Debian Buster bug report at <https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=911496> concerning the e1000e lockups, but I would like to repeat that for the relevant kernel bug tracker in case there is no Debian response to that bug report.

Comment 33 Alan W. Irwin 2018-11-04 05:56:09 UTC

Created attachment 142358 [details]
tarball containing log information concerning latest lockup

Comment 34 Alan W. Irwin 2018-11-04 06:40:16 UTC

I have discovered this box became significantly less stable when there were two users displaying X directly on it (one with startx , one with startx -- :1) and using ctrl-alt-F1 and ctrl-alt-F2 to switch between the two local X servers that were displaying two different XFCE desktops via the RX 550 graphics card. After moving to that mode of operation we got the following results for uptimes before lockups:

1 day,  2 times
2 days, 1 time
3 days, 1 time

For each of these 4 lockups I could not spot any relevant messages in the log files.  But the substantially shorter uptimes for this method of using this box does appear to confirm there are still issues with the graphics stack for the RX550.  But the graphics content being displayed by the two users is roughly similar so I don't understand why this mode of operation is so much less stable then if just one user is using the RX 550 while the other is using an X-terminal.  (None of these lockups occurred anywhere near the times we switched between the two local X servers, but I suppose it is possible that switching sets up a condition that results in a lockup much later.)

Anyhow, because of the increased instability I gave up on the two local X servers approach and went back to the one local X server and one X-terminal approach, and with that approach we got an uptime of a week before the system locked up.  That lockup occurred tonight, and I have attached a tarball containing log files that show many NMI error messages associated with that lockup (but with no reference to the e1000e module this time).

@Michel Dänzer: Could you please take a look at these log files and let me know if this is the best place to report the present lockup?

Comment 35 fin4478 2018-11-10 08:38:45 UTC

To prevent random kernel lock ups with Ryzen, enable RCU_NOCB_CPU in the kernel configuration  and boot the kernel with the rcu_nocbs=0-X command line parameter. X is the cpu thread count -1. To fix this with bios, set to Typical Current Idle  in the bios Advanced/AMD CBS menu.

Comment 36 Alan W. Irwin 2018-11-15 02:51:29 UTC

(In reply to fin4478 from comment #35)
> To prevent random kernel lock ups with Ryzen, enable RCU_NOCB_CPU in the
> kernel configuration  and boot the kernel with the rcu_nocbs=0-X command
> line parameter. X is the cpu thread count -1. To fix this with bios, set to
> Typical Current Idle  in the bios Advanced/AMD CBS menu.

I was quickly able to verify all you said at <https://community.amd.com/thread/225795> and <https://bugzilla.kernel.org/show_bug.cgi?id=196683>.  So it appears all Linux Ryzen owners should be aware of this "idle" issue and take the necessary workarounds, but despite over many months publicizing my Linux Ryzen troubles in a number of different Linux forums (including this bug report) and many different google searches I remained clueless about this bad Linux Ryzen situation until now.  So many thanks for being the first to clue me in!

It took me a while to figure out how to rebuild the latest Debian Buster kernel (4.18.10) with RCU_NOCB_CPU enabled, but I have done that now and just rebooted with that custom kernel using the rcu_nocbs=0-15 kernel parameter (my Ryzen 7 1700 has 8 cores and 16 threads).

So my hopes are high that this step will clean up the lockup issues I have been experiencing when my system was idling at night.  But I have also experienced lockups when the system was being used so the rcu_nocbs=0-15 workaround may not be the sole step I have to take to stabilize my Linux Ryzen system.

Comment 37 GitLab Migration User 2019-09-25 18:03:57 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1314.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.