Bug 107432 - Periodic complete system lockup with Vega M and Kernel 4.18-rc6+
Summary: Periodic complete system lockup with Vega M and Kernel 4.18-rc6+
Status: RESOLVED NOTABUG
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/AMDgpu (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-07-31 01:10 UTC by Robert Strube
Modified: 2018-10-23 23:24 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
System log leading up to hard crash (10.47 KB, text/x-log)
2018-07-31 01:10 UTC, Robert Strube
no flags Details
dmesg log leading up to system crash (more detailed) (8.30 KB, text/plain)
2018-07-31 01:31 UTC, Robert Strube
no flags Details
Use kvmalloc in amdgpu_uvd_suspend (1.24 KB, patch)
2018-07-31 09:09 UTC, Michel Dänzer
no flags Details | Splinter Review
dmesg log leading up to out of memory scenario (no crash this time) (31.83 KB, text/plain)
2018-08-02 05:59 UTC, Robert Strube
no flags Details

Description Robert Strube 2018-07-31 01:10:56 UTC
Created attachment 140902 [details]
System log leading up to hard crash

Description:

Periodically my system will begin to slow down dramatically (the mouse cursor hitches as I try to move it) and I am unable to interact with anything on the screen.  Eventually the mouse cursor disappears altogether.  Trying to switch to a tty I do get prompted to login, but after entering my credentials nothing happens.  It appears to be a hard lockup.  The only solution is to manually power down my machine and reboot.

This probably happens one or two times a day, normally after starting a new application.

Hardware:
Dell XPS 15 9575 2 in 1 (Kaby Lake G)

Versions:
Kernel 4.18-rc7
Mesa 18.1.5
Xorg 1.19.6
uCode for Vega M from Linux Firmware git (master) which includes the latest 18.20 uCode from AMD that was recently merged into Linux Firmware

I do have the two sinks available (one for the Intel iGPU and one for the AMD Vega M), running:

xrandr --listproviders

Lists the following:

Providers: number : 2
Provider 0: id: 0x6f cap: 0x9, Source Output, Sink Offload crtcs: 3 outputs: 7 associated providers: 1 name:modesetting
Provider 1: id: 0x45 cap: 0x6, Sink Output, Source Offload crtcs: 6 outputs: 0 associated providers: 1 name:Unknown AMD Radeon GPU @ pci:0000:01:00.0

And running:

env DRI_PRIME=1 glxinfo | grep "OpenGL renderer"

Lists:

OpenGL renderer string: AMD VEGAM (DRM 3.26.0, 4.18.0-041800rc7-generic, LLVM 6.0.0)

So the Vega M is active and available in my system.

I noticed that this problem started happening after the release of kernel 4.18-rc6 and continues with 4.18-rc7. I've been using 4.18 since rc1 without issue.  This entry in the changelog caught my eye:

Leo Liu (1):
      drm/amdgpu: Make sure IB tests flushed after IP resume

Not sure if this is at all related, but the reason I bring this up is because 
the errors I see in my logs everytime I encounter this problem are:

kernel: amdgpu 0000:01:00.0: GPU pci config reset
kernel: [drm:amdgpu_device_ip_suspend [amdgpu]] *ERROR* suspend of IP block <uvd_v6_0> failed -12

Please note that so far I have only encountered this problem when launching applications that use my Intel iGPU (i.e. I am not setting DRI_PRIME=1).

I've attached my entire log to provide more context.

Thanks!
Comment 1 Robert Strube 2018-07-31 01:31:32 UTC
Created attachment 140903 [details]
dmesg log leading up to system crash (more detailed)
Comment 2 Michel Dänzer 2018-07-31 09:09:51 UTC
Created attachment 140905 [details] [review]
Use kvmalloc in amdgpu_uvd_suspend

Does this patch help by any chance?

If not, can you bisect between 4.18-rc1 and -rc6? Note that from your description, you'll need to test for at least one day before declaring a commit good (if you hit a failure, you can immediately declare that commit bad).
Comment 3 Robert Strube 2018-08-02 05:47:50 UTC
(In reply to Michel Dänzer from comment #2)
> Created attachment 140905 [details] [review] [review]
> Use kvmalloc in amdgpu_uvd_suspend
> 
> Does this patch help by any chance?
> 
> If not, can you bisect between 4.18-rc1 and -rc6? Note that from your
> description, you'll need to test for at least one day before declaring a
> commit good (if you hit a failure, you can immediately declare that commit
> bad).

Hi Michel,

Thank you for the patch.  I've rebuilt the kernel with the changes in your patch and am currently going to test it out over the next several days.

I've noticed that the problem seems to occur when there is a large amount of memory pressure (e.g. I'm running a VM where I've allocated lots of memory), and almost always after I've just opened a new application windows.  Perhaps a web browser, text editor, etc.

Today I had a scenario (running the vanilla 4.18-rc7) where I simply ran out of memory *BUT* this occurred in the absence of opening up a new application window, and the system was able to recover gracefully.

I do have 16GB of RAM in my system, but I can easily hit the limit by running a VM and opening several applications.

Should I conduct tests with memory pressure applied to see if your patch addresses the issue?  Are we trying to simulate the same scenario as before?

I'll report back my results.

P.S. I've attached another dmesg.log from the out of memory problems I ran into today (again running on vanilla 4.18-rc7 and not using your patch) so you can compare the two scenarios.  This scenario did not result in a complete system lockup, so something different must have occurred.

Thanks!
Comment 4 Robert Strube 2018-08-02 05:59:53 UTC
Created attachment 140932 [details]
dmesg log leading up to out of memory scenario (no crash this time)

In this scenario the memory pressure the system was experiencing did not lead to a system crash.  The main difference here was that I did not open a new application, the applications I already had open simply exhausted the memory I had available.
Comment 5 Robert Strube 2018-08-02 17:21:53 UTC
So I've been conducting lots of additional investigation with both the vanilla kernel (4.18-rc7) and the kernel with your patch.

I took more time to try to recreate the scenarios that cause the crash (monitoring system resources, etc.) and this is when I realized that my swapfile was very small (only 2GB).

Short story - Upon further investigation I don't believe this is a bug with DRM/amdgpu but rather the crash was caused because I simply ran out of memory *and* swapspace combined.

I feel a little silly about this, I'm running Ubuntu 18.04 and I guess the default swapfile size is 2GB.  I'm used to using swap partions which are the same size as the system RAM, so I never considered that I could be running out of *both*.

I think at this point it's safe to close the bug.  I'm going to increase my swapfile size to 16GB and monitor my system more closely.  If I get the hard system crash I'll first determine if I ran out of swap, and then if it appears I had enough swap, I'll reopen this bug.

Thanks for your assistance!
Comment 6 Michel Dänzer 2018-08-03 07:32:07 UTC
Well, https://bugs.freedesktop.org/attachment.cgi?id=140903 definitely shows an amdgpu issue exacerbating the memory pressure situation — it tries to allocate 4M of physically contiguous memory. My patch fixes that. Can you confirm that the patch at least doesn't cause any additional issues of its own?
Comment 7 Robert Strube 2018-08-03 17:32:44 UTC
(In reply to Michel Dänzer from comment #6)
> Well, https://bugs.freedesktop.org/attachment.cgi?id=140903 definitely shows
> an amdgpu issue exacerbating the memory pressure situation — it tries to
> allocate 4M of physically contiguous memory. My patch fixes that. Can you
> confirm that the patch at least doesn't cause any additional issues of its
> own?

Hey! Good point.

I ran the custom kernel for a couple days without issue.  Would you like me to do some more testing? I went back to vanilla 4.18-rc7 - but I'd be happy to make my daily driver 4.18-rc7 with the patch.

My understanding is that kvmalloc is a slightly safer way of allocating memory as compared to kmalloc - in that it doesn't necessarily need the memory to be contiguous.  The downside is that it's not quite a performant.  Is this correct?
Comment 8 Robert Strube 2018-08-06 18:33:58 UTC
(In reply to Michel Dänzer from comment #6)
> Well, https://bugs.freedesktop.org/attachment.cgi?id=140903 definitely shows
> an amdgpu issue exacerbating the memory pressure situation — it tries to
> allocate 4M of physically contiguous memory. My patch fixes that. Can you
> confirm that the patch at least doesn't cause any additional issues of its
> own?

I've now moved to 4.18-rc8.  Would you like me to apply your patch to this release and report back?

Thanks!
Rob
Comment 9 Michel Dänzer 2018-08-14 10:52:44 UTC
Please test https://patchwork.freedesktop.org/patch/242563/ instead.
Comment 10 Robert Strube 2018-10-05 20:59:23 UTC
Hello Michel,

Apologies, I've been pretty busy with work the last month or so.  I'm now available again to test out your patch (not sure if this has already made it's way into mainline?).

I'm currently running 4.18.7.  Please let me know if I can start to help out on this issue again.

Rob
Comment 11 Michel Dänzer 2018-10-09 10:01:56 UTC
The patch landed in 4.19-rc1.
Comment 12 Robert Strube 2018-10-23 23:24:26 UTC
Thanks Michel!

I'm currently running 4.19.  I'll put my system under memory pressure and see if things are working OK.

Rob


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.