93301 – ns2_linux32: radeon VM fault on Hawaii (+mmap errors)

Bug 93301 - ns2_linux32: radeon VM fault on Hawaii (+mmap errors)

Summary: ns2_linux32: radeon VM fault on Hawaii (+mmap errors)

Status:	RESOLVED FIXED

Alias:	None

Product:	Mesa
Classification:	Unclassified
Component:	Drivers/Gallium/radeonsi (show other bugs)
Version:	git
Hardware:	Other Linux (All)

Importance:	medium normal
Assignee:	Default DRI bug account
QA Contact:	Default DRI bug account

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2015-12-09 04:10 UTC by Thomas DEBESSE
Modified:	2017-10-09 15:34 UTC (History)
CC List:	3 users (show)

See Also:
i915 platform:
i915 features:

Attachments
syslog (Linux 4.3, mesa git, radeon hawaii r9 390X) (13.49 MB, text/plain) 2015-12-09 04:10 UTC, Thomas DEBESSE	Details
ns2 stdout (nohyperz) (5.86 KB, text/plain) 2015-12-09 09:23 UTC, Thomas DEBESSE	Details
ns2 stdout (nodcc) (77.97 KB, text/plain) 2015-12-09 09:44 UTC, Thomas DEBESSE	Details
ns2 stdout (default) (102.44 KB, text/plain) 2015-12-09 09:52 UTC, Thomas DEBESSE	Details
ns2 stdout + gdb backtrace (176.96 KB, text/plain) 2015-12-09 10:18 UTC, Thomas DEBESSE	Details
ns2 stdout + gdb full backtrace + libdrm-radeon1-dbg symbols (83.76 KB, text/plain) 2015-12-09 10:37 UTC, Thomas DEBESSE	Details
0001-XXX-winsys-radeon-limit-buffer-cache-size.patch (1.18 KB, patch) 2015-12-09 20:28 UTC, Nicolai Hähnle	Details \| Splinter Review
ns2 stdout + gdb full backtrace + libdrm-radeon1-dbg symbols / patched mesa + R600_DEBUG=info (97.74 KB, text/plain) 2015-12-10 01:47 UTC, Thomas DEBESSE	Details
GTT and VRAM usage after 30 min play (1.02 MB, image/png) 2017-01-14 19:54 UTC, famo	Details
GTT and VRAM on crash (1.35 MB, image/png) 2017-01-14 20:25 UTC, famo	Details
Show Obsolete (1) View All

Description Thomas DEBESSE 2015-12-09 04:10:26 UTC

Created attachment 120427 [details]
syslog (Linux 4.3, mesa git, radeon hawaii r9 390X)

Hi I (re) tried Natural Selection 2 yesterday, and I it crashed (as expected), and this time the syslog said:

```
Dec  8 07:20:25 gollum kernel: [16588.241364] radeon 0000:01:00.0: GPU fault detected: 146 0x0e65e014
Dec  8 07:20:25 gollum kernel: [16588.241368] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00007673
Dec  8 07:20:25 gollum kernel: [16588.241369] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x050E0014
Dec  8 07:20:25 gollum kernel: [16588.241370] VM fault (0x04, vmid 2) at page 30323, write from 'CB1' (0x43423100) (224)
---8<--- 120 000+ lines ---------------------
Dec  8 07:22:58 gollum kernel: [16741.758495] radeon 0000:01:00.0: GPU fault detected: 146 0x06a59014
Dec  8 07:22:58 gollum kernel: [16741.758506] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00007673
Dec  8 07:22:58 gollum kernel: [16741.758512] radeon 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x05150014
Dec  8 07:22:58 gollum kernel: [16741.758518] VM fault (0x04, vmid 2) at page 30323, write from 'CB6' (0x43423600) (336)
Dec  8 07:23:03 gollum kernel: [16746.694729] ns2_linux32[16192]: segfault at 1 ip 00000000f3bf8d12 sp 00000000f0b9adf8 error 4 in libc-2.21.so[f3acc000+1b4000]
```

Well, I'm not expecting to see this game working one day, I own this game since 2013-12-15 (it's almost two years) and it never worked, so these guys proved their product is a joke and I don't expect nothing from them. But perhaps this bug can help you to track things on your side, that's why I report it.

When the error happened, NS2 was loading assets since 20min to load a tutorial map.

I don't have more logs currently but if you have some ideas of things I can record, I can try to reproduce it. I agree the current log is a bit quiet.

I've already seen radeon errors like that before in a dpm issue (see #93288 ) but the current bug is unrelated to dpm (dpm was inactive in the current case).

I use a Radeon R9 390X with Mesa git on Linux 4.3 here.

Comment 1 Nicolai Hähnle 2015-12-09 05:52:01 UTC

Hi Thomas! The VM faults are in backend writes. Can you try running with R600_DEBUG=nohyperz and/or nodcc?

Comment 2 Michel Dänzer 2015-12-09 06:22:35 UTC

Note that the game crash probably isn't directly related to the GPUVM faults. We'd need to see backtrace of the crash (preferably from gdb) to say if the crash is due to a bug in the game or Mesa.

Comment 3 Thomas DEBESSE 2015-12-09 09:21:47 UTC

(In reply to Michel Dänzer from comment #2)
> Note that the game crash probably isn't directly related to the GPUVM
> faults.

Perhaps, but they were spamed right before the crash and stopped with it.

> We'd need to see backtrace of the crash (preferably from gdb) to say
> if the crash is due to a bug in the game or Mesa.

I will try that.

(In reply to Nicolai Hähnle from comment #1)
> Hi Thomas! The VM faults are in backend writes. Can you try running with
> R600_DEBUG=nohyperz and/or nodcc?

Right now I started it with R600_DEBUG=nohyperz and I get the output I’ll join.

On stdout side I got a 'radeon: mmap failed, errno: 12' error and on the syslog side I got a standard segfault message. There was no other message in syslog.

Comment 4 Thomas DEBESSE 2015-12-09 09:23:10 UTC

Created attachment 120429 [details]
ns2 stdout (nohyperz)

Comment 5 Thomas DEBESSE 2015-12-09 09:39:53 UTC

Good to know, using R600_DEBUG=nodcc the game loaded fine. There was many 'radeon: mmap failed, errno: 1' lines printed but the games does not crashed on graphical rendering (it crashed while playing a weapon sound). I was able to load a map and move in it, so that nodcc option workarounds the graphical rendering issue.

Comment 6 Thomas DEBESSE 2015-12-09 09:44:36 UTC

Created attachment 120430 [details]
ns2 stdout (nodcc)

Comment 7 Thomas DEBESSE 2015-12-09 09:52:21 UTC

Created attachment 120431 [details]
ns2 stdout (default)

Comment 8 Thomas DEBESSE 2015-12-09 10:18:30 UTC

Created attachment 120432 [details]
ns2 stdout + gdb backtrace

Comment 9 Thomas DEBESSE 2015-12-09 10:37:39 UTC

Created attachment 120433 [details]
ns2 stdout + gdb full backtrace + libdrm-radeon1-dbg symbols

Comment 10 Thomas DEBESSE 2015-12-09 10:39:35 UTC

So, I added a gdb backtrace, I guess the interesting part is:

---8<-------------
Thread 31 (Thread 0xe04e8b40 (LWP 26311)):
#0  0xf7fd9be5 in __kernel_vsyscall ()
No symbol table info available.
#1  0xf478025c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/i386-linux-gnu/libpthread.so.0
No symbol table info available.
#2  0xe2e9cabc in radeon_drm_cs_emit_ioctl (param=0x8386300) at ../../../../../../../include/c11/threads_posix.h:159
        ws = 0x8386300
        cs = <optimized out>
        i = <optimized out>
#3  0xe2e9c117 in impl_thrd_routine (p=0x856a708) at ../../../../../../../include/c11/threads_posix.h:87
        pack = {func = 0xe2e9ca60 <radeon_drm_cs_emit_ioctl>, arg = 0x8386300}
#4  0xf477b1aa in start_thread () from /lib/i386-linux-gnu/libpthread.so.0
No symbol table info available.
#5  0xf44b5fde in clone () from /lib/i386-linux-gnu/libc.so.6
------------->8---

Comment 11 Nicolai Hähnle 2015-12-09 12:39:12 UTC

Thanks for the followup. The last crash you reported

Comment 12 Nicolai Hähnle 2015-12-09 12:51:19 UTC

Thank you for the followup.

Concerning your last backtrace: The crash is in a different thread than the one you mentioned, but the symbols in the backtrace suggest that it happens while updating a buffer, so the crash is likely a consequence of the failed mmap and the game not checking whether the map succeeded.

I'm noticing a lot of other error messages in stdout, though, such as the "Failed to launch child process" and something about allocation failures that is unrelated to Mesa. Perhaps the game's virtual address space is simply getting full? It does call itself a 32 bit program.

I'm a bit surprised in hindsight that nodcc had an effect. Now that I think about it, Hawaii doesn't actually have DCC, does it?

Comment 13 Alex Deucher 2015-12-09 14:18:44 UTC

(In reply to Nicolai Hähnle from comment #12)
> I'm a bit surprised in hindsight that nodcc had an effect. Now that I think
> about it, Hawaii doesn't actually have DCC, does it?

Right.  Hawaii (CI) does not have DCC, only VI parts do.

Comment 14 Marek Olšák 2015-12-09 14:35:13 UTC

No, Hawaii doesn't have DCC.

The issue can indeed be that there is not enough address space. Mesa doesn't unmap most buffers, because mmap is a very costly operation. Such buffers are only unmapped when destroyed. This is obviously a problem on 32-bit. Possible solutions:

1) Always unmap buffers at appropriate places. This can have a significant performance cost in CPU-bound apps.

2) Clear pb_cache on mmap failure and try mmap again.

(pb_cache typically contains staging buffers, which are always mapped. Therefore, pb_cache completely hides allocation, GPU mmap, and CPU mmap overhead, but it also contributes to address space usage.)

Comment 15 Nicolai Hähnle 2015-12-09 20:28:49 UTC

Created attachment 120440 [details] [review]
0001-XXX-winsys-radeon-limit-buffer-cache-size.patch

Following the hypothesis that your game is running out of address space, here is a quick patch for Mesa that limits the buffer cache size to 4MB as a simple test.

Try this (feel free to play around with the number in the patch), plus reducing any graphics quality options the game may have to the minimum. Don't be surprised if performance becomes worse.

Please also set the environment variable R600_DEBUG=info and provide the output, as this allows us to see the default buffer cache size.

Comment 16 Thomas DEBESSE 2015-12-09 21:05:19 UTC

OK. Thanks, I will try all these things. I just have one question, since this software is running in 32 bit mode, does it mean I have to recompile mesa in 32bit too ?

Comment 17 Marek Olšák 2015-12-09 21:56:00 UTC

(In reply to Thomas DEBESSE from comment #16)
> OK. Thanks, I will try all these things. I just have one question, since
> this software is running in 32 bit mode, does it mean I have to recompile
> mesa in 32bit too ?

You only have to recompile Mesa 32-bit, but not 64-bit.

Comment 18 Thomas DEBESSE 2015-12-10 01:46:09 UTC

OK, I patched and recompiled mesa, but it changes nothing (see incoming log).

Setting R600_DEBUG=info I get:

---8<-------------
radeon_drm_winsys_create: create cache of 4194304 bytes
pci_id = 0x67b0
family = 57
chip_class = 9
gart_size = 2043 MB
vram_size = 8192 MB
max_sclk = 1080
max_compute_units = 44
max_se = 4
max_sh_per_se = 1
drm = 2.43.0
has_uvd = 1
vce_fw_version = 671220224
r600_num_backends = 16
r600_clock_crystal_freq = 27000
r600_tiling_config = 0x1023
r600_num_tile_pipes = 16
r600_max_pipes = 11
r600_virtual_address = 1
r600_has_dma = 1
r600_backend_map = 0
r600_backend_map_valid = 1
si_tile_mode_array_valid = 1
cik_macrotile_mode_array_valid = 1
Num displays: 1
------------->8---

Comment 19 Thomas DEBESSE 2015-12-10 01:47:16 UTC

Created attachment 120443 [details]
ns2 stdout + gdb full backtrace + libdrm-radeon1-dbg symbols / patched mesa + R600_DEBUG=info

Comment 20 Marek Olšák 2015-12-10 11:35:30 UTC

(In reply to Thomas DEBESSE from comment #19)
> Created attachment 120443 [details]
> ns2 stdout + gdb full backtrace + libdrm-radeon1-dbg symbols / patched mesa
> + R600_DEBUG=info

This looks like the mmap error has been fixed.

Comment 21 Nicolai Hähnle 2015-12-10 13:01:12 UTC

Indeed. You'll also find that thread 1 commits suicide because TCMalloc cannot find more memory. So it's definitely still an out of memory condition, but without more evidence to the contrary I'm inclined to say it's probably not the OpenGL driver's fault.

Comment 22 Marek Olšák 2015-12-10 14:41:35 UTC

Yeah, it could be a memory leak of some kind. I recommend running the game with this environment variable:
GALLIUM_HUD=VRAM-usage,GTT-usage
You should see if it's leaking GPU buffers. You can also try running it with valgrind, but exit the game before it crashes.

Comment 23 famo 2017-01-14 19:54:20 UTC

Created attachment 128951 [details]
GTT and VRAM usage after 30 min play

Here is a screenshot of GTT and VRAM usage after 30min of play.

From these graphs alone I couldn't make out any memory leaks.

However the game crashed soon afterwards on an new map.


System Info:
Card: Advanced Micro Devices [AMD/ATI] Bonaire XTX [Radeon R7 260X]

glxinfo | grep OpenGL
OpenGL vendor string: X.Org
OpenGL renderer string: Gallium 0.4 on AMD BONAIRE (DRM 2.46.0 / 4.8.6-1-CHAKRA, LLVM 3.9.1)
OpenGL core profile version string: 4.3 (Core Profile) Mesa 13.0.2
OpenGL core profile shading language version string: 4.30
OpenGL core profile context flags: (none)
OpenGL core profile profile mask: core profile
OpenGL core profile extensions:
OpenGL version string: 3.0 Mesa 13.0.2
OpenGL shading language version string: 1.30
OpenGL context flags: (none)

Comment 24 famo 2017-01-14 20:25:25 UTC

Created attachment 128953 [details]
GTT and VRAM on crash

Here is another screenshot, taken right on a crash.

Comment 25 Kamil Páral 2017-01-14 22:31:16 UTC

Please note that more people report crashing on radeonsi after some time of playing:
http://forums.unknownworlds.com/discussion/comment/2330666/#Comment_2330666
This is with "Failed to allocate" errors.

For me personally, NS2 didn't even start on older Mesa (mmap failed), now it starts, but as people report, seems to crash after some time. I personally never played that long (because loading the shaders takes an eternity).

I'd guess this is simply the game running out of 32bit space, and Mesa drivers being more memory hungry than proprietary drivers. The devs had OOM problems even with nvidia drivers in the past and had to cut down on memory usage. They said they can't even add more skins because they're that close to the RAM limit. They are working on a 64bit version of the game, which should hopefully fix all this.

Comment 26 famo 2017-01-14 23:00:37 UTC

(In reply to Kamil Páral from comment #25)
> Please note that more people report crashing on radeonsi after some time of
> playing:
> http://forums.unknownworlds.com/discussion/comment/2330666/#Comment_2330666
> This is with "Failed to allocate" errors.
> 
I know (and if you compare the system info, you'll find a lot similarities ... ;-)


> For me personally, NS2 didn't even start on older Mesa (mmap failed), now it
> starts, but as people report, seems to crash after some time. I personally
> never played that long (because loading the shaders takes an eternity).
> 
Yeah, same here, just go it to load wite mesa 13 and llvm 3.9.1.
With regard to loading times, I guess shader-on-disk cache should help here...


> I'd guess this is simply the game running out of 32bit space, and Mesa
> drivers being more memory hungry than proprietary drivers. The devs had OOM
> problems even with nvidia drivers in the past and had to cut down on memory
> usage. They said they can't even add more skins because they're that close
> to the RAM limit. They are working on a 64bit version of the game, which
> should hopefully fix all this.

Comment 27 Thomas DEBESSE 2017-01-15 01:59:57 UTC

On my end I re-tried to play ns2_linux32 with my R9 390X two days ago running Linux 4.9 (amdgpu), Mesa 17~git and LLVM 4~svn and I was able to play a complete alien tutorial without crashes and the rendering looked OK to me. By the way I haven't seen the game running on another platform so I can miss something, but everything looked acceptable, I don't know if it's the expected rendering, but it's a nice, acceptable and honest rendering. I've never played too much this game to be able to hit the 32bit memory limit.

Comment 28 famo 2017-08-24 21:02:18 UTC

I just want to report that with new mesa and cache NS2 runs as smooth as butter, thanks to the devs!

System:    Host: cray Kernel: 4.12.4-1-CHAKRA x86_64 (64 bit) Desktop: KDE Plasma 5.10.4 Distro: Chakra
Machine:   Device: desktop Mobo: ASUSTeK model: A88XM-PLUS v: Rev X.0x UEFI: American Megatrends v: 3003 date: 03/04/2017
CPU:       Quad core AMD A10-7850K Radeon R7 12 Compute Cores 4C+8G (-MCP-) cache: 8192 KB 
           clock speeds: max: 4200 MHz 1: 4200 MHz 2: 3000 MHz 3: 3000 MHz 4: 3000 MHz
Graphics:  Card: Advanced Micro Devices [AMD/ATI] Fiji [Radeon R9 FURY / NANO Series]
           Display Server: x11 (X.Org 1.19.3) driver: amdgpu Resolution: 2560x1440@119.88hz
           OpenGL: renderer: Gallium 0.4 on AMD FIJI (DRM 3.15.0 / 4.12.4-1-CHAKRA, LLVM 4.0.1)
           version: 4.5 Mesa 17.1.5
Audio:     Card-1 Advanced Micro Devices [AMD] FCH Azalia Controller driver: snd_hda_intel
           Card-2 Advanced Micro Devices [AMD/ATI] Fiji HDMI/DP Audio Controller driver: snd_hda_intel
           Sound: Advanced Linux Sound Architecture v: k4.12.4-1-CHAKRA
Network:   Card: Realtek RTL8111/8168/8411 PCI Express Gigabit Ethernet 
Sensors:   System Temperatures: cpu: 44.0C mobo: 34.0C gpu: 41.0
           Fan Speeds (in rpm): fan-1: 826 fan-2: 841 fan-3: 717
Info:      Processes: 213 Uptime: 11:34 Memory: 4278.6/15994.4MB Client: Shell (bash) inxi: 2.3.23

Comment 29 Kamil Páral 2017-08-25 08:36:19 UTC

Yes, I can confirm NS2 works well now. One thing to note is that when compiling the shaders, the memory usage of the game is much higher than when running from the cache (even *after* the compilation has ended), I'm not sure whether it's a game problem or driver problem of not freeing all the memory. In that case, it's still possible to hit the 32b memory limit. But when running from cache, I've seen no memory issue and rendering is fine.

Comment 30 Nicolai Hähnle 2017-10-09 15:34:29 UTC

This seems to be fixed. If the high memory use is still an issue some place, it seems best to open a new bug.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.