Summary: | [snb] GPU HANG in gnome-shell (after swap?) | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | Chris Murphy <bugzilla> | ||||||||||||
Component: | DRM/Intel | Assignee: | Intel GFX Bugs mailing list <intel-gfx-bugs> | ||||||||||||
Status: | RESOLVED MOVED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> | ||||||||||||
Severity: | major | ||||||||||||||
Priority: | medium | CC: | intel-gfx-bugs, lakshminarayana.vudum | ||||||||||||
Version: | unspecified | ||||||||||||||
Hardware: | Other | ||||||||||||||
OS: | All | ||||||||||||||
Whiteboard: | Triaged, ReadyForDev | ||||||||||||||
i915 platform: | SNB | i915 features: | GPU hang | ||||||||||||
Attachments: |
|
Description
Chris Murphy
2019-08-29 06:00:44 UTC
Created attachment 145193 [details]
dmesg
Created attachment 145194 [details]
drm card0 error
Created attachment 145195 [details]
lspci -vvnn
Allocation failure on swap-in is not uncommon when dealing with 5G+ of swap, the kernel struggles to cope and we make more noise than most. That failure does not look to be the cause of the later hang, though it may indeed be related to memory pressure (although being snb it is llc so less susceptible to most forms of corruption, you can still hypothesize data not making it to/from swap that leads to context corruption). I would say the memory layout of the batch supports the hypothesis that the context has been swapped out and back in. So I am going to err on the side of assuming this is an invalid context image due to swap. Created attachment 145211 [details]
dmesg kernel 5.2.11
Point of comparison with a different kernel. It looks like the same thing. I guess I just don't see these messages with the non-debug kernels.
(In reply to Chris Wilson from comment #4) > Allocation failure on swap-in is not uncommon when dealing with 5G+ of swap, > the kernel struggles to cope and we make more noise than most. Interesting. This suggests an incongruence between typical 1:1 RAM swap partition sizes by most distro installers, at least for use cases where there will be heavy pressure on RAM rather than incidental swap usage. In your view, is this a case of, "doctor, it hurts when I do this" and the doctor says, "right, so don't do that" or is there room for improvement? Note: these examples are unique in that the test system is using swap on ZRAM. So it should be significantly faster than conventional swap on a partition. Also, these examples have /dev/zram0 sized to 1.5X RAM, but it's reproducible at 1:1. In smaller swap cases, I've seen these same call traces far less frequently, and also oom-killer happens more frequently. > That failure > does not look to be the cause of the later hang, though it may indeed be > related to memory pressure (although being snb it is llc so less susceptible > to most forms of corruption, you can still hypothesize data not making it > to/from swap that leads to context corruption). I would say the memory > layout of the batch supports the hypothesis that the context has been > swapped out and back in. So I am going to err on the side of assuming this > is an invalid context image due to swap. The narrow goal of this torture test is to find ways of improving system responsiveness under heavy swap use. And also it acts much like an unprivileged fork bomb that can, somewhat non-deterministically I'm finding, take down the system (totally unresponsive for >30 minutes). And in doing that, I'm stumbling over other issues like this one. For desktops, it's a problem to not have swap big enough to support hibernation. (In reply to Chris Murphy from comment #6) > (In reply to Chris Wilson from comment #4) > > Allocation failure on swap-in is not uncommon when dealing with 5G+ of swap, > > the kernel struggles to cope and we make more noise than most. > > Interesting. This suggests an incongruence between typical 1:1 RAM swap > partition sizes by most distro installers, at least for use cases where > there will be heavy pressure on RAM rather than incidental swap usage. In > your view, is this a case of, "doctor, it hurts when I do this" and the > doctor says, "right, so don't do that" or is there room for improvement? It's definitely the kernel's problem in mishandling resources, there are plenty still available, we just aren't getting the pages when they are required, as they are required. Aside from that, we are not prioritising interactive workloads very well under these conditions. From our point of view that only increases the mempressure for graphic resources -- work builds up faster than we can process, write amplification from client to display. > Note: these examples are unique in that the test system is using swap on > ZRAM. So it should be significantly faster than conventional swap on a > partition. Also, these examples have /dev/zram0 sized to 1.5X RAM, but it's > reproducible at 1:1. In smaller swap cases, I've seen these same call traces > far less frequently, and also oom-killer happens more frequently. > > > That failure > > does not look to be the cause of the later hang, though it may indeed be > > related to memory pressure (although being snb it is llc so less susceptible > > to most forms of corruption, you can still hypothesize data not making it > > to/from swap that leads to context corruption). I would say the memory > > layout of the batch supports the hypothesis that the context has been > > swapped out and back in. So I am going to err on the side of assuming this > > is an invalid context image due to swap. > > The narrow goal of this torture test is to find ways of improving system > responsiveness under heavy swap use. And also it acts much like an > unprivileged fork bomb that can, somewhat non-deterministically I'm finding, > take down the system (totally unresponsive for >30 minutes). And in doing > that, I'm stumbling over other issues like this one. Yup. Death-by-swap is an old problem (when the oomkiller doesn't kill you, you can die of old age waiting for a response wishing it had). Most of our effort is spent trying to minimise the system-wide impact when running at max memory (when the caches are regularly reaped), handling swap well has been an after thought for a decade. Created attachment 145212 [details]
dmesg conventional swap, 5.3.0rc6
This is perhaps superfluous Test with a conventional swap on plain partition on SSD, and the same thing happens. We can say it's not caused by swap on ZRAM.
(In reply to Chris Wilson from comment #7) > It's definitely the kernel's problem in mishandling resources, there are > plenty still available, we just aren't getting the pages when they are > required, as they are required. I see this very pronounced in the conventional swap on SSD case above, where top reports ~60% wa, and while free RAM is low, there's still quite a lot of swap left. But not a lot of activity compared to the swap on ZRAM case. Active(file): 94364 kB $ cat /proc/meminfo MemTotal: 8025296 kB MemFree: 120132 kB MemAvailable: 119600 kB Buffers: 84 kB Cached: 232996 kB SwapCached: 601992 kB Active: 6403420 kB Inactive: 980736 kB Active(anon): 6309056 kB Inactive(anon): 914428 kB Active(file): 94364 kB Inactive(file): 66308 kB Unevictable: 23220 kB Mlocked: 0 kB SwapTotal: 8214524 kB SwapFree: 3756296 kB Dirty: 840 kB Writeback: 0 kB AnonPages: 6899812 kB Mapped: 128652 kB Shmem: 72784 kB KReclaimable: 116684 kB Slab: 324752 kB SReclaimable: 116684 kB SUnreclaim: 208068 kB KernelStack: 15296 kB PageTables: 44364 kB NFS_Unstable: 0 kB Bounce: 0 kB WritebackTmp: 0 kB CommitLimit: 12227172 kB Committed_AS: 15204776 kB VmallocTotal: 34359738367 kB VmallocUsed: 40452 kB VmallocChunk: 0 kB Percpu: 20864 kB HardwareCorrupted: 0 kB AnonHugePages: 0 kB ShmemHugePages: 0 kB ShmemPmdMapped: 0 kB CmaTotal: 0 kB CmaFree: 0 kB HugePages_Total: 0 HugePages_Free: 0 HugePages_Rsvd: 0 HugePages_Surp: 0 Hugepagesize: 2048 kB Hugetlb: 0 kB DirectMap4k: 381944 kB DirectMap2M: 7917568 kB [chris@fmac ~]$ > Yup. Death-by-swap is an old problem (when the oomkiller doesn't kill you, > you can die of old age waiting for a response wishing it had). Most of our > effort is spent trying to minimise the system-wide impact when running at > max memory (when the caches are regularly reaped), handling swap well has > been an after thought for a decade. I've tried quite a lot of variations, different sized swaps, swap on ZRAM, and zswap. And mostly it seems like rearranging deck chairs. I'm not getting enough quality data to have any idea which one is even marginally better, there's always some trade off. I guess I should focus instead on ways of containing unprivileged fork bombs - better they get mad in their own box than take down the whole system. *** Bug 111930 has been marked as a duplicate of this bug. *** -- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/intel/issues/385. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.