105509 – SIGBUS, Bus error during command buffer recording

Bug 105509 - SIGBUS, Bus error during command buffer recording

Summary: SIGBUS, Bus error during command buffer recording

Status:	RESOLVED NOTABUG

Alias:	None

Product:	Mesa
Classification:	Unclassified
Component:	Drivers/Vulkan/intel (show other bugs)
Version:	git
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium normal
Assignee:	Intel 3D Bugs Mailing List
QA Contact:	Intel 3D Bugs Mailing List

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2018-03-14 15:22 UTC by Vyacheslav
Modified:	2018-03-17 05:26 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:
i915 features:

Attachments
vulkaninfo (106.53 KB, text/plain) 2018-03-14 15:22 UTC, Vyacheslav	Details
View All

Description Vyacheslav 2018-03-14 15:22:49 UTC

Created attachment 138110 [details]
vulkaninfo

My system is Linux archlinux 4.15.8-1-ARCH #1 SMP PREEMPT Sat Mar 10 00:00:33 UTC 2018 x86_64 GNU/Linux.
My gpu is intel hd graphics 630.
I installed latest mesa from git (18.1.0_devel.100894.fcf267ba08)

When recording command buffer I get error:

Program received signal SIGBUS, Bus error.

anv_state_stream_alloc (stream=stream@entry=0x55559dbf9dd8, size=64, alignment=alignment@entry=32) at vulkan/anv_allocator.c:913

913       VG_NOACCESS_WRITE(&sb->block, stream->block);

0  in anv_state_stream_alloc of vulkan/anv_allocator.c:913
1  in anv_cmd_buffer_alloc_dynamic_state of vulkan/anv_batch_chain.c:654
2  in anv_cmd_buffer_push_constants of vulkan/anv_cmd_buffer.c:729
3  in cmd_buffer_flush_push_constants of vulkan/genX_cmd_buffer.c:2420
4  in gen9_cmd_buffer_flush_state of vulkan/genX_cmd_buffer.c:2571
5  in gen9_CmdDrawIndexed of vulkan/genX_cmd_buffer.c:2709
6  in ?? of /usr/lib/libVkLayer_core_validation.so
7  in ?? of /usr/lib/libVkLayer_parameter_validation.so
8  in ?? of /usr/lib/libVkLayer_threading.so
9  in vkcmd_create_secondary_command_buffer of vkcmd.c:207
10 in vkcmd_create_secondary_command_buffer_for_inst of vkcmd.c:88
11 in scn_load_scene of scene.c:407
12 in create_scene of main.c:903
13 in main of main.c:583

I send push constants with matrix and color (80 bytes) for both stages. The limit is 128 bytes so it should be fine. I get no validation output from standard layer.
The function that creates secondary command buffer: https://pastebin.com/vN2WjA1W

Comment 1 Vyacheslav 2018-03-14 15:31:10 UTC

I zipped vktrace file and put on dropbox because it is too big (40MB): https://www.dropbox.com/s/ko5cqlrb5baj2wy/trace.7z?dl=0

Comment 2 Vyacheslav 2018-03-15 12:03:38 UTC

I recompiled mesa with --enable-debug and now I get assert failed which is more useful.

0  in raise of /usr/lib/libc.so.6
1  in abort of /usr/lib/libc.so.6
2  in __assert_fail_base of /usr/lib/libc.so.6
3  in __assert_fail of /usr/lib/libc.so.6
4  in anv_block_pool_expand_range of vulkan/anv_allocator.c:327
5  in anv_block_pool_grow of vulkan/anv_allocator.c:521
6  in anv_block_pool_alloc_new of vulkan/anv_allocator.c:564
7  in anv_block_pool_alloc of vulkan/anv_allocator.c:582
8  in anv_fixed_size_state_pool_alloc_new of vulkan/anv_allocator.c:655
9  in anv_state_pool_alloc_no_vg of vulkan/anv_allocator.c:772
10 in anv_state_stream_alloc of vulkan/anv_allocator.c:909
11 in anv_cmd_buffer_alloc_dynamic_state of vulkan/anv_batch_chain.c:654
12 in anv_cmd_buffer_push_constants of vulkan/anv_cmd_buffer.c:729
13 in cmd_buffer_flush_push_constants of vulkan/genX_cmd_buffer.c:2420
14 in gen9_cmd_buffer_flush_state of vulkan/genX_cmd_buffer.c:2571
15 in gen9_CmdDrawIndexed of vulkan/genX_cmd_buffer.c:2709

This assert fails:
assert(size - center_bo_offset <=
          BLOCK_POOL_MEMFD_SIZE - BLOCK_POOL_MEMFD_CENTER);
Because size=1073741824 is more than 1GB and block pools are restricted by 1GB:
/* Block pools are backed by a fixed-size 1GB memfd */
#define BLOCK_POOL_MEMFD_SIZE (1ul << 30)

Still don't know how to fix it. Is it valid behavior? Why it's not handled by validation layers? Should I use a new command pool for next portion of objects?

Comment 3 Vyacheslav 2018-03-15 17:11:00 UTC

So the root cause is the anv_allocator. Command buffer takes up a lot of maped memory: 2320+4096+16384 bytes for anv_cmd_buffer and surface state, dynamic state. Then allocator rounds this to be power of 2 - 32KB. So one command buffer takes up 32KB. And maximum amount of maped memory per device is 1GB. It's enough for ~32000 command buffers. After that mmap fails.

So default allocator was designed for small number of command buffers.

Comment 4 Vyacheslav 2018-03-15 20:37:50 UTC

Workaround is to use one command buffer per frame (or per thread). Just record everything inside primary command buffer. It is worse for performance but it is better for mmap and it solved the problem.

Comment 5 Jason Ekstrand 2018-03-17 05:26:15 UTC

(In reply to Vyacheslav from comment #4)
> Workaround is to use one command buffer per frame (or per thread). Just
> record everything inside primary command buffer. It is worse for performance
> but it is better for mmap and it solved the problem.

If you are using enough secondary command buffers to blow past the 1GB limit, then you are probably going to get garbage performance anyway.  We have to do a full GPU on every vkCmdExecuteCommands so, while performance is probably fine if secondary command buffers are used sparingly, over-use of them can cause significant perf problems.

I think there still is something of a driver bug here in that we really should be throwing VK_ERROR_OUT_OF_DEVICE_MEMORY instead of asserting or getting you int a situation where you can get SIGBUS.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.