Bug 104760

Summary: System hang when use glTexImage3D to specify a 3D texture image
Product: DRI Reporter: xinghua <xinghua.cao>
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: RESOLVED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: medium CC: chris, danylo.piliaiev, eero.t.tamminen, intel-gfx-bugs, kenneth, lionel.g.landwerlin, yang.gu
Version: DRI gitKeywords: bisected, regression, security
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
See Also: https://bugs.freedesktop.org/show_bug.cgi?id=106106
https://bugs.freedesktop.org/show_bug.cgi?id=106136
Whiteboard: Triaged
i915 platform: SKL i915 features:
Attachments:
Description Flags
text3dsizelimit.c
none
attachment-20682-0.html
none
i965: limit texture total size
none
attachment-10542-0.html none

Description xinghua 2018-01-24 04:39:32 UTC
Created attachment 136929 [details]
text3dsizelimit.c

Produce steps:
1. Attached file is the C source code;
2. Build the source file(I built it on Ubuntu 17.10, Intel® HD Graphics (Coffeelake 3x8 GT2)), "gcc -o tex3dsizelimit tex3dsizelimit.c -lX11 -lepoxy";
3. run "./tex3dsizelimit", system hangs.

Actually, this issue reported form WebGL Conformance Tests. If you want to reproduce this issue on chrome, follow below steps:
1. Download latest chrome and install it on Ubuntu 17.10.
2. Open chrome, and open the link, https://www.khronos.org/registry/webgl/sdk/tests/conformance2/textures/misc/tex-3d-size-limit.html?webglVersion=2&quiet=0
3. system also hangs.

I had checked the linux kernel log, and it report some messages as below:
tex3dsizelimit: page allocation stalls for 32092ms, order:0, mode:0x14204d2(GFP_HIGHUSER|__GFP_RETRY_MAYFAIL|__GFP_RECLAIMABLE), nodemask=(null)

The root cause seemed that could not alloc big memory for texture image.

In this case, calls TexImage3D to specify the texture image from level maxLevels to level 0.
for(int i = 0; i < maxLevels; i++)
  {
    int size = 1 << i;
    int level = maxLevels - i - 1;
    glTexImage3D(GL_TEXTURE_3D, level, GL_RGBA, size, 1, 1, 0, GL_RGBA, GL_UNSIGNED_BYTE, NULL);
  }
If calls TexImage3D to specify the texture image from level 0 to level maxLevels, the system will not hang.
Is there different memory alloc memory machanism between these two situations? Is every TexImage3D call realloc a new and entire texture images?
Or may I not follow spec to use TexImage3D?
Comment 1 Andriy Khulap 2018-01-24 14:09:51 UTC
I am unable to reproduce this issue on:

Intel(R) Core(TM) i5-6440HQ CPU @ 2.60GHz
Intel(R) HD Graphics 530 (Skylake GT2)  (0x191b)
Ubuntu 16.04 LTS (Kernel 4.4.0-109-generic)
X.Org X Server 1.18.4
Mesa git master (ec4bb693a017) 
Mesa 17.3.3 (bc1503b13fcf)

Attached program executes without issues.
$ ./tex3dsizelimit 

	visual 0xef selected
$ echo $?
0

On the WebGl test page I'm also seeing all test as PASS, no failures.
Comment 2 xinghua 2018-01-24 14:13:41 UTC
It is regression on Ubuntu 17.10, could you test it on Ubuntu 17.10, thank you.
Comment 3 Mark Janes 2018-01-24 14:49:26 UTC
xingua:  Please verify that this occurs with latest mesa on ubuntu 17.10 by installing the padoka ppa.
Comment 4 vadym 2018-01-25 08:45:04 UTC
Able to reproduce this on Ubuntu 17.10. As a result my laptop completely hangs so I'm not able to collect any logs at all. 

My setup info:
OS: Ubuntu 17.10 64-bit
CPU: Intel® Core™ i7-7500U CPU @ 2.70GHz × 4
GPU: Intel® HD Graphics 620 (Kaby Lake GT2)
mesa: OpenGL ES 3.2 Mesa 17.2.4
kernel: 4.13.0-31-generic
Comment 5 Andriy Khulap 2018-01-25 13:25:19 UTC
Upgraded Ubuntu 16.04 LTS to:

X.Org X Server 1.19.5
Release Date: 2017-10-12
X Protocol Version 11, Revision 0

Kernel 4.13.0-31-generic

And I am able to reproduce this issue. System hangs on both program and WebGL test page.

Bisected to:
eb1497e968bd4a0edc1606e8a6f708fab3248828 is the first bad commit
commit eb1497e968bd4a0edc1606e8a6f708fab3248828
Author: Kenneth Graunke <kenneth@whitecape.org>
Date:   Fri Jul 21 12:29:30 2017 -0700

    i965/bufmgr: Allocate BO pages outside of the kernel's locking.
    
    Suggested by Chris Wilson.
    
    v2: Set the write domain to 0 (suggested by Chris).
    
    Reviewed-by: Matt Turner <mattst88@gmail.com>

Reverting this commit (git revert eb1497e968bd) from the latest mesa master 18.1.0-devel (57b0ccd178bc) solved this hang.

Looks like there is a problem in kernel driver handling that ioctl.
Last working kernel in Ubuntu 16.04 LTS was 4.13.0-26-generic, hangs with 4.13.0-31-generic.
Comment 6 Mark Janes 2018-01-25 16:44:00 UTC
I think we need to see if this bug reproduces on the latest upstream kernel.  The Mesa patch could have a dependency on a kernel patch that is missing from the ubuntu kernel, or this may be a bug in the ubuntu kernel.
Comment 7 Andriy Khulap 2018-01-26 11:41:01 UTC
Tested upstream kernels (as described in https://wiki.ubuntu.com/KernelTeam/GitKernelBuild):

- 4.15.0-rc9 (git latest 993ca2068b04)
- 4.14.0 (git v14.0 bebc6082da0a)

And the hang is also present. So it is not Ubuntu-specific.
Comment 8 Tapani Pälli 2018-01-29 10:03:08 UTC
Ken, any thoughts if some particular kernel change is missing?
Comment 9 Kenneth Graunke 2018-02-01 06:23:16 UTC
This is pretty surprising...it's a whole system hang?  Or a GPU hang?

I suppose we may be allocating pages earlier, so maybe we're running out of memory for something critical later, rather than running out of memory for the big 3D texture...

But, it still seems pretty fishy.  Doesn't seem like this should cause lockups.

Chris, do you have any ideas?
Comment 10 Andriy Khulap 2018-02-01 07:31:37 UTC
This is whole system hang, nothing is working, only hard rebooting helps. So can't get any additional info.
Tried the latest git mesa master (ef272b161e05) and drm-tip kernel (4.15.0, a2fbc8000254) with the same result - hang.
Comment 11 Andriy Khulap 2018-02-28 10:16:34 UTC
Just for information, still hangs on latest Debian testing:
- Mesa 18.1.0-devel (git-ab94875352)
- Kernel 4.15.0-1-amd64 #1 SMP Debian 4.15.4-1 (2018-02-18)
Comment 12 Mark Janes 2018-02-28 16:49:30 UTC
Andriy:

From your comments, we have the following failure pattern:

linux 4.4 / Mesa 17.3.3:  PASS
linux 4.13 / Mesa 17.2.4: FAIL

Please bisect between 4.13 and 4.4 to determine which kernel commit is breaking Ken's BO allocations.
Comment 13 Mark Janes 2018-02-28 22:02:13 UTC
Reverting eb1497e968bd4a0edc1606e8a6f708fab3248828 on master prevents the system hang reported in this bug.

We still require a kernel bisection, so we can figure out if this patch is wrong, or if there is an issue with the kernel.
Comment 14 Chris Wilson 2018-03-01 23:46:50 UTC
Forgot about this oomkiller scenario. From the description and looking at the test, it just looks like oomkiller rampage. It completely exhausts all my memory, often leaving the oomkiler with little choice to panic the system.
Comment 15 Kenneth Graunke 2018-03-02 01:04:36 UTC
I just ran this locally, and indeed got piles of oom-killer.  My system survived - a couple programs got killed - and Mesa eventually returned GL_OUT_OF_MEMORY.

I'm not sure what we can do about this, to be honest.  It sounds like the 'system hang' is that the Linux OOM killer torches something critical...which would be a general Linux problem with being out of memory.

Prior to the bisected patch, Mesa would allocate pages for the texture on the first access.  Now, it allocates it on creation.  This programs happens to allocate a texture, and never use it.  But if you ever did use it, you'd suffer the same fate.  I find it highly unlikely that any real world program would hit this case - if someone allocates a texture, they probably intend to use it.

Chris, is there some reason that the kernel can't just...swap those pages out?  Nothing is using them.  Perhaps we should madvise them until first use or something?  Or, should we avoid allocating huge things (above some threshold) up front?  Or...really...the OOM killer sabotaging systems seems like a core Linux problem, and not anything we can do much about...
Comment 16 Andriy Khulap 2018-03-02 07:28:14 UTC
I've tried to bisect upstream kernel, but tried to short the range by testing major tags. Tested down to v4.11 and the system still hangs (v4.14-rc4, v4.13, v4.13-rc6, v4.13-rc2, v4.12, v4.11).
Mesa 17.3.3 from debian buster.

Probably I've missed some other libraries were also updated (in Comment 5).
Comment 17 Danylo 2018-08-03 13:13:24 UTC
Kernel was bisected to commit 40e62d5d6be8b4999068da31ee6aca7ca76669ee:
 drm/i915: Acquire the backing storage outside of struct_mutex in set-domain
 https://patchwork.freedesktop.org/patch/119012/

It seems that before that patch memory wasn't immediately allocated but I could be wrong.
Also it seems that OOM killer doesn't know about such allocations and doesn't kill example application until the very end.

However why are we even able to request such a big allocation when creating a texture?
There is a Const.MaxTextureMbytes, checking against which should prevent creation of such texture.

i965 doesn't provide custom TestProxyTexImage and uses _mesa_test_proxy_teximage which doesn't take 'level' into account so the texture with dimensions of 1x1x1 and level=11 easily passes the check.

Later in intel_miptree_create_for_teximage the dimensions of the image at level 0 are determined to be 2048x2048x2048 but at this moment there are no checks of resulting image size.

The solution to this may be a creation of custom TestProxyTexImage where the
size of the image at level 0 will be checked. So the texture size will always
obey the limits.

Also I found that in radeon_miptree_create_for_teximage there are special checks for height and depth being 1, in such case they will be 1 at all levels.
Just an observation...

To sum up:

- One issue is that texture size limit is enforced inconsistently and it can be fixed in Mesa.
- Second is OOM killer being unable to cope with this type of allocations. I don't have any knowledge about this one.
Comment 18 Danylo 2018-08-07 13:04:58 UTC
(In reply to Danylo from comment #17)
> The solution to this may be a creation of custom TestProxyTexImage where the
> size of the image at level 0 will be checked. So the texture size will always
> obey the limits.

Any thoughts whether this is a good solution?
Comment 19 Yang Gu 2018-08-17 11:18:29 UTC
Created attachment 141168 [details]
attachment-20682-0.html

Yang is OOO from Aug 10 to 19 for SIGGRAPH 2018. Please expect slow response.
Comment 20 Mark Janes 2018-09-21 23:06:16 UTC
Since this has been bisected to a kernel commit, should it be assigned to Mesa?
Comment 21 Danylo 2018-09-24 08:00:10 UTC
While there is a bisected commit it possibly not a kernel issue, at least half of it is not in the kernel.

> However why are we even able to request such a big allocation when creating a texture?
> There is a Const.MaxTextureMbytes, checking against which should prevent creation of such texture.

The main issue as i described in previous comment is Mesa trying to allocate more memory than it itself allows. 

On kernel side the only issue is that the application is being killed the last even though it holds the most memory.
Comment 22 xinghua 2018-12-17 08:47:55 UTC
Hi, all, I think this issue is very serious, system will hang up and should hard shutdown to recover the system. Could you investigate it again? thank you.

As my description in this bug, calls TexImage3D to specify the texture image from level maxLevels to level, system will hang up.
for(int i = 0; i < maxLevels; i++)
{
    int size = 1 << i;
    int level = maxLevels - i - 1;
    glTexImage3D(GL_TEXTURE_3D, level, GL_RGBA, size, 1, 1, 0, GL_RGBA, GL_UNSIGNED_BYTE, NULL);
}

But if calls TexImage3D to specify the texture image from level 0 to level maxLevels, the system will not hang up,
for(int i = 0; i < maxLevels; i++)
{
    int size = 1 << (maxLevels - i - 1);
    int level = i;
    glTexImage3D(GL_TEXTURE_3D, level, GL_RGBA, size, 1, 1, 0, GL_RGBA, GL_UNSIGNED_BYTE, NULL);
}
I do not know why these two ways to alloc memory are very different.
Comment 23 Lionel Landwerlin 2018-12-17 10:23:19 UTC
Reassigning to i915, on my laptop (with 16Gb of ram) it tries to allocate a BO of ~12.25Gb and completely locks up the system.
Comment 24 Denis 2018-12-17 10:59:52 UTC
But according to Danylo, there are two issues here, in mesa side and in kernel side.
So possibly we need one ticket in i965 and one (current) in drm/intel?
Just in case if issue will be fixed only in one place...
Comment 25 Lionel Landwerlin 2018-12-17 12:19:49 UTC
Created attachment 142833 [details] [review]
i965: limit texture total size

A workaround for i965, but we should really not have the i915 allocation lock up the system.
Comment 26 Mark Janes 2018-12-17 16:15:58 UTC
Is the correct resolution to track graphics memory in a way that allows the OOM killer to target the process that is locking up the system?

Lionel's workaround will handle this particular case, but there are many other ways to produce the same effect.
Comment 27 Lionel Landwerlin 2018-12-19 13:17:10 UTC
(In reply to Mark Janes from comment #26)
> Is the correct resolution to track graphics memory in a way that allows the
> OOM killer to target the process that is locking up the system?
> 
> Lionel's workaround will handle this particular case, but there are many
> other ways to produce the same effect.

Indeed a sim(In reply to Denis from comment #24)
> But according to Danylo, there are two issues here, in mesa side and in
> kernel side.
> So possibly we need one ticket in i965 and one (current) in drm/intel?
> Just in case if issue will be fixed only in one place...

I don't think a simple ioctl with the i915 driver with a particular size should lock up the system.
Any userspace program can do that, this isn't related to Mesa.
Comment 28 Eero Tamminen 2019-02-12 11:36:07 UTC
(In reply to Mark Janes from comment #26)
> Is the correct resolution to track graphics memory in a way that allows the
> OOM killer to target the process that is locking up the system?

GFX memory usage tracking isn't just an Intel or 3D specific issue, it's an issue for the whole kernel DRI subsystem.

There's even a CVE about it: https://nvd.nist.gov/vuln/detail/CVE-2013-7445


> Lionel's workaround will handle this particular case, but there are many
> other ways to produce the same effect.

See also bug 106106 and bug 106136.
Comment 29 Yang Gu 2019-02-12 11:36:28 UTC
Created attachment 143364 [details]
attachment-10542-0.html

Yang is off from Feb 3 to 17 for Chinese New Year holidays and extra vacations. Please expect slow email response.
Comment 30 xinghua 2019-04-23 09:51:53 UTC
Had the crash issue been resolved recently? I could not reproduce this issue on Ubuntu Disco Dingo.
In my machine, it either ran correctly, or reported out-of-memory message. I think it is normal to report out-of-memory when system is in low memory situation or could not alloc ~12G memory. Do you think so?
Comment 31 Denis 2019-04-24 13:03:36 UTC
hi. I also re-checked this issue on Manjaro OS (with kernel 5.0.5) and can say that it doesn't lead to the hang anymore.

Test mostly passes except 3 points - they fail with OOM error.

Also compiled test also doesn't lead to freeze anymore. It allocates all my memory (15.9 GB) and then normally terminates.

@Lionel - is this something expected or need to bisect and find, what might fix this behaviour? What do you think?
Comment 32 Lakshmi 2019-07-15 08:17:50 UTC
(In reply to Denis from comment #31)
> hi. I also re-checked this issue on Manjaro OS (with kernel 5.0.5) and can
> say that it doesn't lead to the hang anymore.
> 
> Test mostly passes except 3 points - they fail with OOM error.
> 
> Also compiled test also doesn't lead to freeze anymore. It allocates all my
> memory (15.9 GB) and then normally terminates.
> 
> @Lionel - is this something expected or need to bisect and find, what might
> fix this behaviour? What do you think?

@Lionel, can you answer above questions?
Comment 33 Lionel Landwerlin 2019-07-15 18:02:02 UTC
(In reply to Lakshmi from comment #32)
> (In reply to Denis from comment #31)
> > hi. I also re-checked this issue on Manjaro OS (with kernel 5.0.5) and can
> > say that it doesn't lead to the hang anymore.
> > 
> > Test mostly passes except 3 points - they fail with OOM error.
> > 
> > Also compiled test also doesn't lead to freeze anymore. It allocates all my
> > memory (15.9 GB) and then normally terminates.
> > 
> > @Lionel - is this something expected or need to bisect and find, what might
> > fix this behaviour? What do you think?
> 
> @Lionel, can you answer above questions?

I don't think it's bisectable (or at least not easily) because it has been the behavior for quite some time.
I haven't been around i915 for that long though :)
Comment 34 Denis 2019-07-16 10:18:07 UTC
that's true 8-/ I tried to bisect on drm-tip and drm-intel-testing repos and ended up with errors during compiling 8-/

I am suggesting to close this issue as fixed somewhere between 4.19 and 5.0.21
Comment 35 Lakshmi 2019-07-17 06:45:11 UTC
Closing this issue as Fixed as per the suggestion.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.