88026 – [i855GM, 3.18] X getting stuck in congestion_wait for shrinker

Bug 88026 - [i855GM, 3.18] X getting stuck in congestion_wait for shrinker

Summary: [i855GM, 3.18] X getting stuck in congestion_wait for shrinker

Status:	CLOSED WONTFIX

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	unspecified
Hardware:	Other All

Importance:	lowest normal
Assignee:	Intel GFX Bugs mailing list
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2015-01-04 21:09 UTC by Bruno
Modified:	2017-03-03 16:39 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:	I85X
i915 features:	GEM/Other

Attachments
*/sys/kernel/debug/dri//** (19.08 KB, text/plain) 2015-01-04 21:09 UTC, Bruno	no flags	Details
dmesg, with drm.debug=7 (250.58 KB, text/plain) 2015-01-10 19:37 UTC, Bruno	no flags	Details
View All

Description Bruno 2015-01-04 21:09:53 UTC

Created attachment 111736 [details]
/sys/kernel/debug/dri/*/

On my Acer system I more or less often have system either stall for some short time (graphics wise) or eventually for an unlimited amount of time.

When system does not recover by itself usually the only solution is to kill X.
Dropping VM caches (/proc/sys/vm/drop_caches) does not help.

PCIID: 8086:3582
Kernel: 3.18
DDX: xf86-video-intel-2.99.916
Mesa: 10.2.8


Xorg /proc/$pid/stack:
[<c10b8914>] congestion_wait+0x54/0x90
[<c10aff75>] shrink_inactive_list+0x355/0x390
[<c10b088a>] shrink_zone+0x60a/0x750
[<c10b0d62>] try_to_free_pages+0x392/0x5f0
[<c10a8fb8>] __alloc_pages_nodemask+0x328/0x7d0
[<c10db6b9>] do_huge_pmd_anonymous_page+0xe9/0x320
[<c10c2f6d>] handle_mm_fault+0x2ad/0x800
[<c103088a>] __do_page_fault+0x15a/0x490
[<c1030ccb>] do_page_fault+0xb/0x10
[<c16bb6c1>] error_code+0x65/0x6c
[<c1301e06>] drm_ioctl+0x1c6/0x650
[<c10eebbb>] do_vfs_ioctl+0x34b/0x540
[<c10eedee>] SyS_ioctl+0x3e/0x80
[<c16badd2>] sysenter_after_call+0x0/0x14
[<ffffffff>] 0xffffffff

Comment 1 Chris Wilson 2015-01-04 21:15:50 UTC

Note this is not bug 87955 as CONFIG_DEBUG_MUTEXES is unset.

Comment 2 Chris Wilson 2015-01-05 11:05:18 UTC

Fwiw:

shrink_inactive_list():
/*
 * If kswapd scans pages marked marked for immediate
 * reclaim and under writeback (nr_immediate), it implies
 * that pages are cycling through the LRU faster than
 * they are written so also forcibly stall.
 */
 if (nr_immediate && current_may_throttle())
   congestion_wait(BLK_RW_ASYNC, HZ/10);

nr_immediate is set in shrink_page_list():
if (PageWriteback(page)) {
  if (current_is_kswapd() &&
      PageReclaim(page) &&
      test_bit(ZONE_WRITEBACK, &zone->flags)) {
      nr_immediate++;
      goto keep_locked;
   }
}

which is obviously not true for Xorg.


There is one other call to congestion_wait() at the start of shrink_inactive_list() (might be worth using gdb to confirm which callsite is the blocker):

while (unlikely(too_many_isolated(zone, file, sc)))
   congestion_wait(BLK_RW_ASYNC, HZ/10);

too_many_isolated() is basically NR_ISOLATED_ANON > NR_INACTIVE_ANON and if there is actually no backingdev activity then congestion_wait() will not make any forward progress and it will just loop.

Maybe (though it seems to contradict the intentions of all the comments):

diff --git a/mm/vmscan.c b/mm/vmscan.c
index bd9a72bc4a1b..79a4e9379381 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1488,11 +1488,14 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
        struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat;
 
        while (unlikely(too_many_isolated(zone, file, sc))) {
-               congestion_wait(BLK_RW_ASYNC, HZ/10);
+               long rem = congestion_wait(BLK_RW_ASYNC, HZ/10);
 
                /* We are about to die and free our memory. Return now. */
                if (fatal_signal_pending(current))
                        return SWAP_CLUSTER_MAX;
+
+               if (rem == 0)
+                       break;
        }
 
        lru_add_drain();

Comment 3 Bruno 2015-01-10 19:37:08 UTC

Created attachment 112072 [details]
dmesg, with drm.debug=7

Possibly of interest:
I have transparent huge pages enabled CONFIG_TRANSPARENT_HUGEPAGE=y, CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y.

/proc/meminfo (though take later while firefox was still stuck, even after echo 2 > /proc/sys/vm/drop_caches)

MemTotal:        2034912 kB
MemFree:         1116420 kB
MemAvailable:    1088824 kB
Buffers:             224 kB
Cached:           271964 kB
SwapCached:            0 kB
Active:           496008 kB
Inactive:         371688 kB
Active(anon):     461052 kB
Inactive(anon):   333772 kB
Active(file):      34956 kB
Inactive(file):    37916 kB
Unevictable:       14164 kB
Mlocked:           14164 kB
HighTotal:       1153928 kB
HighFree:         590988 kB
LowTotal:         880984 kB
LowFree:          525432 kB
SwapTotal:       2097148 kB
SwapFree:        2097148 kB
Dirty:                12 kB
Writeback:             0 kB
AnonPages:        609684 kB
Mapped:           132944 kB
Shmem:            194240 kB
Slab:              23208 kB
SReclaimable:      11256 kB
SUnreclaim:        11952 kB
KernelStack:        1904 kB
PageTables:         2988 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:     3114604 kB
Committed_AS:    1802940 kB
VmallocTotal:     122880 kB
VmallocUsed:       12012 kB
VmallocChunk:      93376 kB
AnonHugePages:    118784 kB
DirectMap4k:       53240 kB
DirectMap4M:      856064 kB

Comment 4 Chris Wilson 2015-01-17 16:40:13 UTC

mm: vmscan: fix the page state calculation in too_many_isolated

> Move the zone_page_state_snapshot() fallback logic into
> too_many_isolated(), so shrink_inactive_list() doesn't incorrectly call
> congestion_wait().

Seems like there is a known bug in this area, so lets keep our fingers crossed.

Comment 5 yann 2017-02-24 08:19:20 UTC

We seem to have neglected the bug a bit, apologies.

Bruno, since There were improvements pushed in kernel that will benefit to your system, so please re-test with latest kernel and mark as REOPENED if you can reproduce (and attach fresh gpu error dump & kernel log) and RESOLVED/* if you cannot reproduce.

Comment 6 yann 2017-03-03 16:39:28 UTC

(In reply to yann from comment #5)
> We seem to have neglected the bug a bit, apologies.
> 
> Bruno, since There were improvements pushed in kernel that will benefit to
> your system, so please re-test with latest kernel and mark as REOPENED if
> you can reproduce (and attach fresh gpu error dump & kernel log) and
> RESOLVED/* if you cannot reproduce.

Timeout. Assuming that this is not occurring anymore. If this issue happens again, re-test with latest kernel and REOPEN if you can reproduce (and attach fresh gpu error dump & kernel log)

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.