Bug 109007

Summary: radeonsi cache format changed, causes mesa crash on startup
Product: Mesa Reporter: Daniel Drake <dan>
Component: Drivers/Gallium/radeonsiAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED FIXED QA Contact: Default DRI bug account <dri-devel>
Severity: normal    
Priority: medium CC: maraeo
Version: git   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:

Description Daniel Drake 2018-12-11 07:48:30 UTC
Having upgraded from Mesa-17.3 to Mesa-18.1 in Endless OS, many users on AMD-based platforms are now reporting that the system fails to boot into the UI. I've reproduced and confirm that Xorg is crashing very early on.

Thread 4 "si_shader:0" received signal SIGSEGV, Segmentation fault.
__memcpy_ssse3_back () at ../sysdeps/x86_64/multiarch/memcpy-ssse3-back.S:1533
Backtrace:
#0  __memcpy_ssse3_back ()
    at ../sysdeps/x86_64/multiarch/memcpy-ssse3-back.S:1533
#1  0x00007fffeeba2038 in memcpy (__len=3221880836, __src=0x7fffe4000e70, 
    __dest=<optimized out>) at /usr/include/x86_64-linux-gnu/bits/string3.h:53
#2  read_data (size=3221880836, data=<optimized out>, ptr=0x7fffe4000e70)
    at ../../../../../src/gallium/drivers/radeonsi/si_state_shaders.c:95
#3  read_chunk (ptr=0x7fffe4000e70, ptr@entry=0x7fffe4000e6c, 
    data=data@entry=0x7fffe4000998, size=size@entry=0x7fffe4000980)
    at ../../../../../src/gallium/drivers/radeonsi/si_state_shaders.c:121
#4  0x00007fffeeba21b3 in si_load_shader_binary (
    shader=shader@entry=0x7fffe40008c0, binary=binary@entry=0x7fffe4000e00)
    at ../../../../../src/gallium/drivers/radeonsi/si_state_shaders.c:187
#5  0x00007fffeeba4810 in si_shader_cache_load_shader (shader=0x7fffe40008c0, 
    ir_binary=0x7fffe4000a50, sscreen=0x555555a393a0)
    at ../../../../../src/gallium/drivers/radeonsi/si_state_shaders.c:275
#6  si_init_shader_selector_async (job=job@entry=0x555555b8dfa0, 
    thread_index=thread_index@entry=0)
    at ../../../../../src/gallium/drivers/radeonsi/si_state_shaders.c:1875
#7  0x00007fffee747a55 in util_queue_thread_func (
    input=input@entry=0x555555a39fb0) at ../../../src/util/u_queue.c:271
#8  0x00007fffee7476c7 in impl_thrd_routine (p=<optimized out>)
    at ../../../include/c11/threads_posix.h:87
#9  0x00007ffff574d494 in start_thread (arg=0x7fffebe06700)

The problem here is that the on-disk radeonsi cache format changed without consideration for this in the code. The affected codepath is si_load_shader_binary() which does:

	uint32_t size = *ptr++;
	uint32_t crc32 = *ptr++;
        [...]
	ptr = read_data(ptr, &shader->config, sizeof(shader->config));
	ptr = read_data(ptr, &shader->info, sizeof(shader->info));
	ptr = read_chunk(ptr, (void**)&shader->binary.code,
			 &shader->binary.code_size);

So, the blob format is: 4 bytes size, 4 bytes CRC, shader config, shader info, code.

In mesa-17.3 the si_shader_config was 48 bytes in size, but in Mesa-18.1 and current master, si_shader_config is 52 bytes in size, because the max_simd_wave field was added.

After upgrading mesa to 18.1, with shaders compiled and cached by mesa-17.3, now the above code will obviously not behave as intended. We enter into read_chunk() with the offsets slightly wrong:

	*size = *ptr++;
	assert(*data == NULL);
	if (!*size)
		return ptr;
	*data = malloc(*size);
	return read_data(ptr, *data, *size);

and when this code executes, *size has value 3221880836, for a shader that was only 884 bytes uncompressed. read_data then tries to memcpy this much data, and that causes the crash.

In addition to the lack of invalidation of existing disk caches after the on-disk format was changed, this code also seems rather suspect in that it does not verify that it is not reading beyond the end of the shader. As an attacker I could maliciously rewrite the size field read by the read_chunk() code above to be very large, fixup the CRC and recompress, and then I could cause other apps to crash in this way.
Comment 1 Daniel Drake 2018-12-11 07:54:01 UTC
The on-disk cache format change was introduced by
   radeonsi: move max_simd_waves computation into a separate function 
https://github.com/mesa3d/mesa/commit/c02c9ee550d137fbea3ed105131d621d6af5813b
Comment 2 Marek Olšák 2018-12-11 21:46:38 UTC
Your driver uses timestamps that are normally different with every build. The cache always ignores entries with different timestamps. It looks like your distro removes the timestamps, so it always gets a cache hit.

This should be fixed in Mesa 18.3.
https://cgit.freedesktop.org/mesa/mesa/commit/?id=83ea8dd99bb16e5d9bb880e64cd2047abc536b70

Marek
Comment 3 Daniel Drake 2018-12-12 03:36:57 UTC
Got it. Thanks!

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.