Bug 107652 - amdgpu couldn't resume after suspend
Summary: amdgpu couldn't resume after suspend
Status: NEW
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/AMDgpu (show other bugs)
Version: XOrg git
Hardware: Other All
: medium normal
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-08-21 21:05 UTC by mikhail.v.gavrilov
Modified: 2019-05-20 20:25 UTC (History)
4 users (show)

See Also:
i915 platform:
i915 features:


Attachments
dmesg (208.37 KB, text/plain)
2018-08-21 21:05 UTC, mikhail.v.gavrilov
no flags Details
system log (39.25 MB, text/plain)
2018-08-21 21:05 UTC, mikhail.v.gavrilov
no flags Details
system log (4.19.0-0.rc1.git0.1) (6.75 MB, text/plain)
2018-08-28 20:29 UTC, mikhail.v.gavrilov
no flags Details
memory status before (2.35 KB, text/plain)
2018-08-29 04:27 UTC, mikhail.v.gavrilov
no flags Details
memory status after (2.35 KB, text/plain)
2018-08-29 04:27 UTC, mikhail.v.gavrilov
no flags Details
system log (12.56 MB, text/plain)
2018-08-29 04:29 UTC, mikhail.v.gavrilov
no flags Details
dmesg (4.19.0-0.rc1.git0.1) (8.06 MB, text/plain)
2018-08-29 15:57 UTC, mikhail.v.gavrilov
no flags Details
amdgpu_gem_info before (952.38 KB, text/plain)
2018-08-31 03:51 UTC, mikhail.v.gavrilov
no flags Details
amdgpu_gem_info after (952.50 KB, text/plain)
2018-08-31 03:52 UTC, mikhail.v.gavrilov
no flags Details
0001-drm-amdgpu-Allocate-UVD-FW-BO-backup-RAM-space-on-in.patch (3.35 KB, patch)
2018-09-20 19:21 UTC, Andrey Grodzovsky
no flags Details | Splinter Review
dmesg after patch 0001 (2.43 MB, text/plain)
2018-09-21 03:21 UTC, mikhail.v.gavrilov
no flags Details
kernel log pre and post suspend (177.42 KB, text/plain)
2019-01-09 17:47 UTC, Mart Raudsepp
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description mikhail.v.gavrilov 2018-08-21 21:05:06 UTC
Created attachment 141225 [details]
dmesg

Steps for reproduce:
1) Put the computer into suspend mode.
2) Wake up the computer with the Power button.

$ inxi -bM
System:    Host: localhost.localdomain Kernel: 4.19.0-0.rc0.git3.1.fc30.x86_64 x86_64 bits: 64 
           Desktop: Gnome 3.29.90 Distro: Fedora release 29 (Rawhide) 
Machine:   Type: Desktop Mobo: ASUSTeK model: ROG STRIX X470-I GAMING v: Rev 1.xx serial: <root required> 
           UEFI: American Megatrends v: 0901 date: 07/23/2018 
CPU:       8-Core: AMD Ryzen 7 2700X type: MT MCP speed: 3381 MHz min/max: 2200/3700 MHz 
Graphics:  Card-1: Advanced Micro Devices [AMD/ATI] Vega 10 XT [Radeon RX Vega 64] driver: amdgpu v: kernel 
           Display: wayland server: Fedora Project X.org 11.0 driver: amdgpu resolution: 3840x2160~60Hz 
           OpenGL: renderer: Radeon RX Vega (VEGA10 DRM 3.27.0 4.19.0-0.rc0.git3.1.fc30.x86_64 LLVM 6.0.1) 
           v: 4.5 Mesa 18.1.5 
Network:   Card-1: Intel I211 Gigabit Network driver: igb 
           Card-2: Realtek RTL8822BE 802.11a/b/g/n/ac WiFi adapter driver: r8822be 
Drives:    Local Storage: total: 11.35 TiB used: 4.39 TiB (38.6%) 
Info:      Processes: 575 Uptime: 2h 00m Memory: 31.36 GiB used: 28.11 GiB (89.6%) Shell: bash inxi: 3.0.20
Comment 1 mikhail.v.gavrilov 2018-08-21 21:05:54 UTC
Created attachment 141226 [details]
system log
Comment 2 Andrey Grodzovsky 2018-08-28 16:54:00 UTC
From looking into the log seems your system was out of memory in the time of calling suspend. I see a few user mode apps like steam crashing  before that, coudl be related. 
That in turn caused GPU buffers eviction failure during suspend and hence failures after resume.

See if you can check your memory status before suspending, try to figure out when memory exhausting problem starts, what use case. 

Use commands from here to check memory status - https://www.binarytides.com/linux-command-check-memory-usage/
Comment 3 mikhail.v.gavrilov 2018-08-28 20:28:17 UTC
Yep you right.
But suspend mode will be totally useless on the computer on which no programs are running.
The sence of suspend mode to put the computer to sleep with all running programs, and then wake up and that everything continues to work.

Anyway, I see that in swap there was enough space for unloading the full size of RAM.

$ free -m
              total        used        free      shared  buff/cache   available
Mem:          32158       27500        1054        1193        3603        3007
Swap:         65535        7912       57623


$ cat  /proc/meminfo
MemTotal:       32930572 kB
MemFree:         1149372 kB
MemAvailable:    3127012 kB
Buffers:              28 kB
Cached:          3366532 kB
SwapCached:      1007320 kB
Active:         20999764 kB
Inactive:        3531864 kB
Active(anon):   19666712 kB
Inactive(anon):  2725324 kB
Active(file):    1333052 kB
Inactive(file):   806540 kB
Unevictable:       31468 kB
Mlocked:           31468 kB
SwapTotal:      67108860 kB
SwapFree:       59004668 kB
Dirty:              2008 kB
Writeback:             0 kB
AnonPages:      21151436 kB
Mapped:          1888740 kB
Shmem:           1222624 kB
Slab:             894752 kB
SReclaimable:     301996 kB
SUnreclaim:       592756 kB
KernelStack:       77072 kB
PageTables:       405340 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    83574144 kB
Committed_AS:   347269980 kB
VmallocTotal:   34359738367 kB
VmallocUsed:           0 kB
VmallocChunk:          0 kB
Percpu:            12864 kB
HardwareCorrupted:     0 kB
AnonHugePages:   2207744 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
CmaTotal:              0 kB
CmaFree:               0 kB
HugePages_Total:       0
HugePages_Free:        0
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:               0 kB
DirectMap4k:    29582752 kB
DirectMap2M:     3901440 kB
DirectMap1G:     1048576 kB

$ vmstat -s
     32930572 K total memory
     28110600 K used memory
     21000912 K active memory
      3542156 K inactive memory
      1140784 K free memory
           28 K buffer memory
      3679160 K swap cache
     67108860 K total swap
      8103680 K used swap
     59005180 K free swap
     21926506 non-nice user cpu ticks
      1867047 nice user cpu ticks
      4336923 system cpu ticks
    101407781 idle cpu ticks
       547470 IO-wait cpu ticks
       452621 IRQ cpu ticks
       266687 softirq cpu ticks
            0 stolen cpu ticks
     40223592 pages paged in
     62917184 pages paged out
      2325269 pages swapped in
      4803989 pages swapped out
   2369356089 interrupts
   4293312571 CPU context switches
   1535402349 boot time
       398972 forks
Comment 4 mikhail.v.gavrilov 2018-08-28 20:29:11 UTC
Created attachment 141325 [details]
system log (4.19.0-0.rc1.git0.1)
Comment 5 Andrey Grodzovsky 2018-08-28 22:18:58 UTC
(In reply to mikhail.v.gavrilov from comment #4)
> Created attachment 141325 [details]
> system log (4.19.0-0.rc1.git0.1)


(In reply to mikhail.v.gavrilov from comment #4)
> Created attachment 141325 [details]
> system log (4.19.0-0.rc1.git0.1)

Can you now show memory status after suspend happened and failed ?
Can you also try repeat the test with minimal graphics enabled(switch to FB console, sudo xinit) and then repeat the steps to see if this still happens
Comment 6 mikhail.v.gavrilov 2018-08-29 04:27:27 UTC
Created attachment 141331 [details]
memory status before
Comment 7 mikhail.v.gavrilov 2018-08-29 04:27:44 UTC
Created attachment 141332 [details]
memory status after
Comment 8 mikhail.v.gavrilov 2018-08-29 04:29:33 UTC
Created attachment 141333 [details]
system log
Comment 9 mikhail.v.gavrilov 2018-08-29 04:30:24 UTC
> Can you now show memory status after suspend happened and failed ?
> Can you also try repeat the test with minimal graphics enabled(switch to FB
> console, sudo xinit) and then repeat the steps to see if this still happens



(In reply to mikhail.v.gavrilov from comment #6)
> Created attachment 141331 [details]
> memory status before

# systemctl suspend

(In reply to mikhail.v.gavrilov from comment #7)
> Created attachment 141332 [details]
> memory status after

I make this in FB console but result are same:

(In reply to mikhail.v.gavrilov from comment #8)
> Created attachment 141333 [details]
> system log
Comment 10 mikhail.v.gavrilov 2018-08-29 15:57:56 UTC
Created attachment 141348 [details]
dmesg (4.19.0-0.rc1.git0.1)
Comment 11 Andrey Grodzovsky 2018-08-30 20:01:56 UTC
I see from the log that your failure was on 0 order allocation (1 page) in zone NORMAL but this ZONE still had enough 1 page blocks and even larger blocks to fulfill your request so that strange.
The only problem I see from the logs is that your free memory in zone NORMAL was lower then min watermark watermark, which AFAIK this should have triggered kswapd to start swapping out memory. I do see you have already some pages in swap so maybe that it.
Any way , I can't understand why exactly that failed from the logs. Possibly some memory leaks. 
Please add cat /sys/kernel/debug/dri/0/amdgpu_gem_info immdialy before and after suspend operation to see how much memory the driver allocated.
I will try to ask people from #mm about your log.
Comment 12 mikhail.v.gavrilov 2018-08-31 03:51:57 UTC
Created attachment 141390 [details]
amdgpu_gem_info before
Comment 13 mikhail.v.gavrilov 2018-08-31 03:52:16 UTC
Created attachment 141391 [details]
amdgpu_gem_info after
Comment 14 Andrey Grodzovsky 2018-09-20 19:21:54 UTC
Created attachment 141663 [details] [review]
0001-drm-amdgpu-Allocate-UVD-FW-BO-backup-RAM-space-on-in.patch

This is just a shot in the dark but please give a try - see if it helps with suspend resume issue.
Comment 15 mikhail.v.gavrilov 2018-09-21 03:21:12 UTC
Created attachment 141666 [details]
dmesg after patch 0001
Comment 16 mikhail.v.gavrilov 2018-09-21 03:22:40 UTC
(In reply to Andrey Grodzovsky from comment #14)
> Created attachment 141663 [details] [review] [review]
> 0001-drm-amdgpu-Allocate-UVD-FW-BO-backup-RAM-space-on-in.patch
> 
> This is just a shot in the dark but please give a try - see if it helps with
> suspend resume issue.

The patch couldn't helps.
new dmesg attached here
(In reply to mikhail.v.gavrilov from comment #15)
> Created attachment 141666 [details]
> dmesg after patch 0001
Comment 17 Mart Raudsepp 2019-01-09 17:47:27 UTC
Created attachment 143042 [details]
kernel log pre and post suspend

I still hit something like this with a 4.20 kernel. Perhaps it gives additional data points to figure it out.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.