Created attachment 136929 [details] text3dsizelimit.c Produce steps: 1. Attached file is the C source code; 2. Build the source file(I built it on Ubuntu 17.10, Intel® HD Graphics (Coffeelake 3x8 GT2)), "gcc -o tex3dsizelimit tex3dsizelimit.c -lX11 -lepoxy"; 3. run "./tex3dsizelimit", system hangs. Actually, this issue reported form WebGL Conformance Tests. If you want to reproduce this issue on chrome, follow below steps: 1. Download latest chrome and install it on Ubuntu 17.10. 2. Open chrome, and open the link, https://www.khronos.org/registry/webgl/sdk/tests/conformance2/textures/misc/tex-3d-size-limit.html?webglVersion=2&quiet=0 3. system also hangs. I had checked the linux kernel log, and it report some messages as below: tex3dsizelimit: page allocation stalls for 32092ms, order:0, mode:0x14204d2(GFP_HIGHUSER|__GFP_RETRY_MAYFAIL|__GFP_RECLAIMABLE), nodemask=(null) The root cause seemed that could not alloc big memory for texture image. In this case, calls TexImage3D to specify the texture image from level maxLevels to level 0. for(int i = 0; i < maxLevels; i++) { int size = 1 << i; int level = maxLevels - i - 1; glTexImage3D(GL_TEXTURE_3D, level, GL_RGBA, size, 1, 1, 0, GL_RGBA, GL_UNSIGNED_BYTE, NULL); } If calls TexImage3D to specify the texture image from level 0 to level maxLevels, the system will not hang. Is there different memory alloc memory machanism between these two situations? Is every TexImage3D call realloc a new and entire texture images? Or may I not follow spec to use TexImage3D?
I am unable to reproduce this issue on: Intel(R) Core(TM) i5-6440HQ CPU @ 2.60GHz Intel(R) HD Graphics 530 (Skylake GT2) (0x191b) Ubuntu 16.04 LTS (Kernel 4.4.0-109-generic) X.Org X Server 1.18.4 Mesa git master (ec4bb693a017) Mesa 17.3.3 (bc1503b13fcf) Attached program executes without issues. $ ./tex3dsizelimit visual 0xef selected $ echo $? 0 On the WebGl test page I'm also seeing all test as PASS, no failures.
It is regression on Ubuntu 17.10, could you test it on Ubuntu 17.10, thank you.
xingua: Please verify that this occurs with latest mesa on ubuntu 17.10 by installing the padoka ppa.
Able to reproduce this on Ubuntu 17.10. As a result my laptop completely hangs so I'm not able to collect any logs at all. My setup info: OS: Ubuntu 17.10 64-bit CPU: Intel® Core™ i7-7500U CPU @ 2.70GHz × 4 GPU: Intel® HD Graphics 620 (Kaby Lake GT2) mesa: OpenGL ES 3.2 Mesa 17.2.4 kernel: 4.13.0-31-generic
Upgraded Ubuntu 16.04 LTS to: X.Org X Server 1.19.5 Release Date: 2017-10-12 X Protocol Version 11, Revision 0 Kernel 4.13.0-31-generic And I am able to reproduce this issue. System hangs on both program and WebGL test page. Bisected to: eb1497e968bd4a0edc1606e8a6f708fab3248828 is the first bad commit commit eb1497e968bd4a0edc1606e8a6f708fab3248828 Author: Kenneth Graunke <kenneth@whitecape.org> Date: Fri Jul 21 12:29:30 2017 -0700 i965/bufmgr: Allocate BO pages outside of the kernel's locking. Suggested by Chris Wilson. v2: Set the write domain to 0 (suggested by Chris). Reviewed-by: Matt Turner <mattst88@gmail.com> Reverting this commit (git revert eb1497e968bd) from the latest mesa master 18.1.0-devel (57b0ccd178bc) solved this hang. Looks like there is a problem in kernel driver handling that ioctl. Last working kernel in Ubuntu 16.04 LTS was 4.13.0-26-generic, hangs with 4.13.0-31-generic.
I think we need to see if this bug reproduces on the latest upstream kernel. The Mesa patch could have a dependency on a kernel patch that is missing from the ubuntu kernel, or this may be a bug in the ubuntu kernel.
Tested upstream kernels (as described in https://wiki.ubuntu.com/KernelTeam/GitKernelBuild): - 4.15.0-rc9 (git latest 993ca2068b04) - 4.14.0 (git v14.0 bebc6082da0a) And the hang is also present. So it is not Ubuntu-specific.
Ken, any thoughts if some particular kernel change is missing?
This is pretty surprising...it's a whole system hang? Or a GPU hang? I suppose we may be allocating pages earlier, so maybe we're running out of memory for something critical later, rather than running out of memory for the big 3D texture... But, it still seems pretty fishy. Doesn't seem like this should cause lockups. Chris, do you have any ideas?
This is whole system hang, nothing is working, only hard rebooting helps. So can't get any additional info. Tried the latest git mesa master (ef272b161e05) and drm-tip kernel (4.15.0, a2fbc8000254) with the same result - hang.
Just for information, still hangs on latest Debian testing: - Mesa 18.1.0-devel (git-ab94875352) - Kernel 4.15.0-1-amd64 #1 SMP Debian 4.15.4-1 (2018-02-18)
Andriy: From your comments, we have the following failure pattern: linux 4.4 / Mesa 17.3.3: PASS linux 4.13 / Mesa 17.2.4: FAIL Please bisect between 4.13 and 4.4 to determine which kernel commit is breaking Ken's BO allocations.
Reverting eb1497e968bd4a0edc1606e8a6f708fab3248828 on master prevents the system hang reported in this bug. We still require a kernel bisection, so we can figure out if this patch is wrong, or if there is an issue with the kernel.
Forgot about this oomkiller scenario. From the description and looking at the test, it just looks like oomkiller rampage. It completely exhausts all my memory, often leaving the oomkiler with little choice to panic the system.
I just ran this locally, and indeed got piles of oom-killer. My system survived - a couple programs got killed - and Mesa eventually returned GL_OUT_OF_MEMORY. I'm not sure what we can do about this, to be honest. It sounds like the 'system hang' is that the Linux OOM killer torches something critical...which would be a general Linux problem with being out of memory. Prior to the bisected patch, Mesa would allocate pages for the texture on the first access. Now, it allocates it on creation. This programs happens to allocate a texture, and never use it. But if you ever did use it, you'd suffer the same fate. I find it highly unlikely that any real world program would hit this case - if someone allocates a texture, they probably intend to use it. Chris, is there some reason that the kernel can't just...swap those pages out? Nothing is using them. Perhaps we should madvise them until first use or something? Or, should we avoid allocating huge things (above some threshold) up front? Or...really...the OOM killer sabotaging systems seems like a core Linux problem, and not anything we can do much about...
I've tried to bisect upstream kernel, but tried to short the range by testing major tags. Tested down to v4.11 and the system still hangs (v4.14-rc4, v4.13, v4.13-rc6, v4.13-rc2, v4.12, v4.11). Mesa 17.3.3 from debian buster. Probably I've missed some other libraries were also updated (in Comment 5).
Kernel was bisected to commit 40e62d5d6be8b4999068da31ee6aca7ca76669ee: drm/i915: Acquire the backing storage outside of struct_mutex in set-domain https://patchwork.freedesktop.org/patch/119012/ It seems that before that patch memory wasn't immediately allocated but I could be wrong. Also it seems that OOM killer doesn't know about such allocations and doesn't kill example application until the very end. However why are we even able to request such a big allocation when creating a texture? There is a Const.MaxTextureMbytes, checking against which should prevent creation of such texture. i965 doesn't provide custom TestProxyTexImage and uses _mesa_test_proxy_teximage which doesn't take 'level' into account so the texture with dimensions of 1x1x1 and level=11 easily passes the check. Later in intel_miptree_create_for_teximage the dimensions of the image at level 0 are determined to be 2048x2048x2048 but at this moment there are no checks of resulting image size. The solution to this may be a creation of custom TestProxyTexImage where the size of the image at level 0 will be checked. So the texture size will always obey the limits. Also I found that in radeon_miptree_create_for_teximage there are special checks for height and depth being 1, in such case they will be 1 at all levels. Just an observation... To sum up: - One issue is that texture size limit is enforced inconsistently and it can be fixed in Mesa. - Second is OOM killer being unable to cope with this type of allocations. I don't have any knowledge about this one.
(In reply to Danylo from comment #17) > The solution to this may be a creation of custom TestProxyTexImage where the > size of the image at level 0 will be checked. So the texture size will always > obey the limits. Any thoughts whether this is a good solution?
Created attachment 141168 [details] attachment-20682-0.html Yang is OOO from Aug 10 to 19 for SIGGRAPH 2018. Please expect slow response.
Since this has been bisected to a kernel commit, should it be assigned to Mesa?
While there is a bisected commit it possibly not a kernel issue, at least half of it is not in the kernel. > However why are we even able to request such a big allocation when creating a texture? > There is a Const.MaxTextureMbytes, checking against which should prevent creation of such texture. The main issue as i described in previous comment is Mesa trying to allocate more memory than it itself allows. On kernel side the only issue is that the application is being killed the last even though it holds the most memory.
Hi, all, I think this issue is very serious, system will hang up and should hard shutdown to recover the system. Could you investigate it again? thank you. As my description in this bug, calls TexImage3D to specify the texture image from level maxLevels to level, system will hang up. for(int i = 0; i < maxLevels; i++) { int size = 1 << i; int level = maxLevels - i - 1; glTexImage3D(GL_TEXTURE_3D, level, GL_RGBA, size, 1, 1, 0, GL_RGBA, GL_UNSIGNED_BYTE, NULL); } But if calls TexImage3D to specify the texture image from level 0 to level maxLevels, the system will not hang up, for(int i = 0; i < maxLevels; i++) { int size = 1 << (maxLevels - i - 1); int level = i; glTexImage3D(GL_TEXTURE_3D, level, GL_RGBA, size, 1, 1, 0, GL_RGBA, GL_UNSIGNED_BYTE, NULL); } I do not know why these two ways to alloc memory are very different.
Reassigning to i915, on my laptop (with 16Gb of ram) it tries to allocate a BO of ~12.25Gb and completely locks up the system.
But according to Danylo, there are two issues here, in mesa side and in kernel side. So possibly we need one ticket in i965 and one (current) in drm/intel? Just in case if issue will be fixed only in one place...
Created attachment 142833 [details] [review] i965: limit texture total size A workaround for i965, but we should really not have the i915 allocation lock up the system.
Is the correct resolution to track graphics memory in a way that allows the OOM killer to target the process that is locking up the system? Lionel's workaround will handle this particular case, but there are many other ways to produce the same effect.
(In reply to Mark Janes from comment #26) > Is the correct resolution to track graphics memory in a way that allows the > OOM killer to target the process that is locking up the system? > > Lionel's workaround will handle this particular case, but there are many > other ways to produce the same effect. Indeed a sim(In reply to Denis from comment #24) > But according to Danylo, there are two issues here, in mesa side and in > kernel side. > So possibly we need one ticket in i965 and one (current) in drm/intel? > Just in case if issue will be fixed only in one place... I don't think a simple ioctl with the i915 driver with a particular size should lock up the system. Any userspace program can do that, this isn't related to Mesa.
(In reply to Mark Janes from comment #26) > Is the correct resolution to track graphics memory in a way that allows the > OOM killer to target the process that is locking up the system? GFX memory usage tracking isn't just an Intel or 3D specific issue, it's an issue for the whole kernel DRI subsystem. There's even a CVE about it: https://nvd.nist.gov/vuln/detail/CVE-2013-7445 > Lionel's workaround will handle this particular case, but there are many > other ways to produce the same effect. See also bug 106106 and bug 106136.
Created attachment 143364 [details] attachment-10542-0.html Yang is off from Feb 3 to 17 for Chinese New Year holidays and extra vacations. Please expect slow email response.
Had the crash issue been resolved recently? I could not reproduce this issue on Ubuntu Disco Dingo. In my machine, it either ran correctly, or reported out-of-memory message. I think it is normal to report out-of-memory when system is in low memory situation or could not alloc ~12G memory. Do you think so?
hi. I also re-checked this issue on Manjaro OS (with kernel 5.0.5) and can say that it doesn't lead to the hang anymore. Test mostly passes except 3 points - they fail with OOM error. Also compiled test also doesn't lead to freeze anymore. It allocates all my memory (15.9 GB) and then normally terminates. @Lionel - is this something expected or need to bisect and find, what might fix this behaviour? What do you think?
(In reply to Denis from comment #31) > hi. I also re-checked this issue on Manjaro OS (with kernel 5.0.5) and can > say that it doesn't lead to the hang anymore. > > Test mostly passes except 3 points - they fail with OOM error. > > Also compiled test also doesn't lead to freeze anymore. It allocates all my > memory (15.9 GB) and then normally terminates. > > @Lionel - is this something expected or need to bisect and find, what might > fix this behaviour? What do you think? @Lionel, can you answer above questions?
(In reply to Lakshmi from comment #32) > (In reply to Denis from comment #31) > > hi. I also re-checked this issue on Manjaro OS (with kernel 5.0.5) and can > > say that it doesn't lead to the hang anymore. > > > > Test mostly passes except 3 points - they fail with OOM error. > > > > Also compiled test also doesn't lead to freeze anymore. It allocates all my > > memory (15.9 GB) and then normally terminates. > > > > @Lionel - is this something expected or need to bisect and find, what might > > fix this behaviour? What do you think? > > @Lionel, can you answer above questions? I don't think it's bisectable (or at least not easily) because it has been the behavior for quite some time. I haven't been around i915 for that long though :)
that's true 8-/ I tried to bisect on drm-tip and drm-intel-testing repos and ended up with errors during compiling 8-/ I am suggesting to close this issue as fixed somewhere between 4.19 and 5.0.21
Closing this issue as Fixed as per the suggestion.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.