Created attachment 111736 [details] /sys/kernel/debug/dri/*/ On my Acer system I more or less often have system either stall for some short time (graphics wise) or eventually for an unlimited amount of time. When system does not recover by itself usually the only solution is to kill X. Dropping VM caches (/proc/sys/vm/drop_caches) does not help. PCIID: 8086:3582 Kernel: 3.18 DDX: xf86-video-intel-2.99.916 Mesa: 10.2.8 Xorg /proc/$pid/stack: [<c10b8914>] congestion_wait+0x54/0x90 [<c10aff75>] shrink_inactive_list+0x355/0x390 [<c10b088a>] shrink_zone+0x60a/0x750 [<c10b0d62>] try_to_free_pages+0x392/0x5f0 [<c10a8fb8>] __alloc_pages_nodemask+0x328/0x7d0 [<c10db6b9>] do_huge_pmd_anonymous_page+0xe9/0x320 [<c10c2f6d>] handle_mm_fault+0x2ad/0x800 [<c103088a>] __do_page_fault+0x15a/0x490 [<c1030ccb>] do_page_fault+0xb/0x10 [<c16bb6c1>] error_code+0x65/0x6c [<c1301e06>] drm_ioctl+0x1c6/0x650 [<c10eebbb>] do_vfs_ioctl+0x34b/0x540 [<c10eedee>] SyS_ioctl+0x3e/0x80 [<c16badd2>] sysenter_after_call+0x0/0x14 [<ffffffff>] 0xffffffff
Note this is not bug 87955 as CONFIG_DEBUG_MUTEXES is unset.
Fwiw: shrink_inactive_list(): /* * If kswapd scans pages marked marked for immediate * reclaim and under writeback (nr_immediate), it implies * that pages are cycling through the LRU faster than * they are written so also forcibly stall. */ if (nr_immediate && current_may_throttle()) congestion_wait(BLK_RW_ASYNC, HZ/10); nr_immediate is set in shrink_page_list(): if (PageWriteback(page)) { if (current_is_kswapd() && PageReclaim(page) && test_bit(ZONE_WRITEBACK, &zone->flags)) { nr_immediate++; goto keep_locked; } } which is obviously not true for Xorg. There is one other call to congestion_wait() at the start of shrink_inactive_list() (might be worth using gdb to confirm which callsite is the blocker): while (unlikely(too_many_isolated(zone, file, sc))) congestion_wait(BLK_RW_ASYNC, HZ/10); too_many_isolated() is basically NR_ISOLATED_ANON > NR_INACTIVE_ANON and if there is actually no backingdev activity then congestion_wait() will not make any forward progress and it will just loop. Maybe (though it seems to contradict the intentions of all the comments): diff --git a/mm/vmscan.c b/mm/vmscan.c index bd9a72bc4a1b..79a4e9379381 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1488,11 +1488,14 @@ shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec, struct zone_reclaim_stat *reclaim_stat = &lruvec->reclaim_stat; while (unlikely(too_many_isolated(zone, file, sc))) { - congestion_wait(BLK_RW_ASYNC, HZ/10); + long rem = congestion_wait(BLK_RW_ASYNC, HZ/10); /* We are about to die and free our memory. Return now. */ if (fatal_signal_pending(current)) return SWAP_CLUSTER_MAX; + + if (rem == 0) + break; } lru_add_drain();
Created attachment 112072 [details] dmesg, with drm.debug=7 Possibly of interest: I have transparent huge pages enabled CONFIG_TRANSPARENT_HUGEPAGE=y, CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y. /proc/meminfo (though take later while firefox was still stuck, even after echo 2 > /proc/sys/vm/drop_caches) MemTotal: 2034912 kB MemFree: 1116420 kB MemAvailable: 1088824 kB Buffers: 224 kB Cached: 271964 kB SwapCached: 0 kB Active: 496008 kB Inactive: 371688 kB Active(anon): 461052 kB Inactive(anon): 333772 kB Active(file): 34956 kB Inactive(file): 37916 kB Unevictable: 14164 kB Mlocked: 14164 kB HighTotal: 1153928 kB HighFree: 590988 kB LowTotal: 880984 kB LowFree: 525432 kB SwapTotal: 2097148 kB SwapFree: 2097148 kB Dirty: 12 kB Writeback: 0 kB AnonPages: 609684 kB Mapped: 132944 kB Shmem: 194240 kB Slab: 23208 kB SReclaimable: 11256 kB SUnreclaim: 11952 kB KernelStack: 1904 kB PageTables: 2988 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 3114604 kB Committed_AS: 1802940 kB VmallocTotal: 122880 kB VmallocUsed: 12012 kB VmallocChunk: 93376 kB AnonHugePages: 118784 kB DirectMap4k: 53240 kB DirectMap4M: 856064 kB
mm: vmscan: fix the page state calculation in too_many_isolated > Move the zone_page_state_snapshot() fallback logic into > too_many_isolated(), so shrink_inactive_list() doesn't incorrectly call > congestion_wait(). Seems like there is a known bug in this area, so lets keep our fingers crossed.
We seem to have neglected the bug a bit, apologies. Bruno, since There were improvements pushed in kernel that will benefit to your system, so please re-test with latest kernel and mark as REOPENED if you can reproduce (and attach fresh gpu error dump & kernel log) and RESOLVED/* if you cannot reproduce.
(In reply to yann from comment #5) > We seem to have neglected the bug a bit, apologies. > > Bruno, since There were improvements pushed in kernel that will benefit to > your system, so please re-test with latest kernel and mark as REOPENED if > you can reproduce (and attach fresh gpu error dump & kernel log) and > RESOLVED/* if you cannot reproduce. Timeout. Assuming that this is not occurring anymore. If this issue happens again, re-test with latest kernel and REOPEN if you can reproduce (and attach fresh gpu error dump & kernel log)
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.