Bug 111004 - memcpy accessing GPU memory mappings using SSE instructions breaks in KVM
Summary: memcpy accessing GPU memory mappings using SSE instructions breaks in KVM
Status: REOPENED
Alias: None
Product: Mesa
Classification: Unclassified
Component: Other (show other bugs)
Version: git
Hardware: x86-64 (AMD64) Linux (All)
: medium critical
Assignee: mesa-dev
QA Contact: mesa-dev
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-06-26 12:02 UTC by maxamar
Modified: 2019-06-27 09:05 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description maxamar 2019-06-26 12:02:55 UTC
X crashes with any configuration on AMD RX590 except ESXi & Xen passthru (works in Windows 10).
Replacing memcpy with custom impl partially solved the problem.
Please see this thread on debian bugtracker:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=931066

Before (compiled radeonsi_dri from source):
[   131.909] (EE) 0: /usr/lib/xorg/Xorg (OsLookupColor+0x139) [0x55f57cf882c9]
[   131.909] (EE) 1: /lib/x86_64-linux-gnu/libpthread.so.0 (funlockfile+0x50) [0x7fbb6e85977f]
[   131.910] (EE) 2: /lib/x86_64-linux-gnu/libc.so.6 (memcpy+0x2d7) [0x7fbb6e7263b7]
[   131.910] (EE) 3: /usr/lib/x86_64-linux-gnu/dri/radeonsi_dri.so (radeon_drm_winsys_create+0xc8c8e) [0x7fbb6ced280e]
[   131.910] (EE) 4: /usr/lib/x86_64-linux-gnu/dri/radeonsi_dri.so (radeon_drm_winsys_create+0xa6220) [0x7fbb6ce8ced0]
[   131.911] (EE) 5: /usr/lib/x86_64-linux-gnu/dri/radeonsi_dri.so (radeon_drm_winsys_create+0x96e35) [0x7fbb6ce6e865]
[   131.911] (EE) 6: /usr/lib/x86_64-linux-gnu/dri/radeonsi_dri.so (radeon_drm_winsys_create+0x97b21) [0x7fbb6ce6fe51]
[   131.911] (EE) 7: /usr/lib/x86_64-linux-gnu/dri/radeonsi_dri.so (amdgpu_winsys_create+0x3f1) [0x7fbb6ce40aa1]
[   131.911] (EE) unw_get_proc_name failed: no unwind info found [-10]
[   131.911] (EE) 8: /usr/lib/x86_64-linux-gnu/dri/radeonsi_dri.so (?+0x0) [0x7fbb6cc22100]
[   131.912] (EE) 9: /usr/lib/x86_64-linux-gnu/dri/radeonsi_dri.so (__driDriverGetExtensions_virtio_gpu+0x9d698) [0x7fbb6cd5d288]
[   131.912] (EE) 10: /usr/lib/x86_64-linux-gnu/dri/radeonsi_dri.so (__driDriverGetExtensions_virtio_gpu+0x40ea) [0x7fbb6cc2a6da]
[   131.912] (EE) 11: /usr/lib/x86_64-linux-gnu/dri/radeonsi_dri.so (__driDriverGetExtensions_virtio_gpu+0x12f8) [0x7fbb6cc24968]
[   131.912] (EE) 12: /usr/lib/x86_64-linux-gnu/libgbm.so.1 (gbm_surface_has_free_buffers+0x1b06) [0x7fbb6da271b6]
[   131.913] (EE) 13: /usr/lib/x86_64-linux-gnu/libgbm.so.1 (gbm_surface_has_free_buffers+0x1e83) [0x7fbb6da27833]
[   131.913] (EE) 14: /usr/lib/x86_64-linux-gnu/libgbm.so.1 (gbm_create_device+0x57) [0x7fbb6da235d7]
[   131.913] (EE) unw_get_proc_name failed: no unwind info found [-10]
[   131.913] (EE) 15: /usr/lib/xorg/modules/drivers/amdgpu_drv.so (?+0x0) [0x7fbb6da3d650]
[   131.913] (EE) 16: /usr/lib/xorg/Xorg (InitOutput+0x9c0) [0x55f57ce6a6a0]
[   131.913] (EE) 17: /usr/lib/xorg/Xorg (InitFonts+0x1cf) [0x55f57ce2d76f]
[   131.914] (EE) 18: /lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main+0xeb) [0x7fbb6e6a809b]
[   131.914] (EE) 19: /usr/lib/xorg/Xorg (_start+0x2a) [0x55f57ce1767a]
[   131.914] (EE)
[   131.914] (EE) Illegal instruction at address 0x7fbb6e7262f7

After (replace memcpy in mesa libs in radeonsi with custom simple impl): X boots ok but error in amdgpu dmesg (hangs):
[ 3473.934176] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=2, emitted seq=3
[ 3473.934234] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* Process information: process Xorg pid 17702 thread Xorg:cs0 pid 17703
[ 3473.934239] amdgpu 0000:01:00.0: GPU reset begin!
[ 3474.466516] amdgpu 0000:01:00.0: GPU pci config reset

Tested both on current & latest kernels, oibaf drivers don't help.

Thanks.
Comment 1 Michel Dänzer 2019-06-26 13:57:57 UTC
(In reply to maxamar from comment #0)
> [   131.909] (EE) 0: /usr/lib/xorg/Xorg (OsLookupColor+0x139) [0x55f57cf882c9]
> [   131.909] (EE) 1: /lib/x86_64-linux-gnu/libpthread.so.0 (funlockfile+0x50) [0x7fbb6e85977f]
> [   131.910] (EE) 2: /lib/x86_64-linux-gnu/libc.so.6 (memcpy+0x2d7) [0x7fbb6e7263b7]
> [...]
> [   131.914] (EE) Illegal instruction at address 0x7fbb6e7262f7

This looks like a bug in /lib/x86_64-linux-gnu/libc.so.6, executing an instruction which isn't supported by your CPU.


> After (replace memcpy in mesa libs in radeonsi with custom simple impl):

How exactly did you "replace memcpy"?


> X boots ok but error in amdgpu dmesg (hangs):
> [ 3473.934176] [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout,
> signaled seq=2, emitted seq=3

If this isn't due to an issue with your memcpy replacement, it's probably a Mesa issue or maybe a kernel one, but most certainly not an xf86-video-amdgpu one.
Comment 2 maxamar 2019-06-26 16:17:17 UTC
> This looks like a bug in /lib/x86_64-linux-gnu/libc.so.6, executing an instruction which isn't supported by your CPU.
That's really KVM by-design bug - offending instruction is SSE movups which in conjunction with GPU address space requires KVM to emulate this SSE instruction which it can't.
I insist that this is Mesa's issue as standard memcpy would need to know if it runs inside KVM & if the address is in GPU space. Mesa should have it's own memcpy at least for accessing GPU memory space.

> How exactly did you "replace memcpy"?
In the source code replace calls to memcpy with calls to memcpy_new.

> If this isn't due to an issue with your memcpy replacement, it's probably a Mesa issue or maybe a kernel one, but most certainly not an xf86-video-amdgpu one.
These messages are generated by AMD amdgpu kernel module.

BTW I solved my issue by changing BIOS to UEFI in KVM, however, baremetal version still doesn't work which is not good. Now my glibc memcpy choses another path without SSE instructions, I think.
Comment 3 maxamar 2019-06-27 07:38:54 UTC
Issue reoccurs after change 10GB RAM -> 30GB RAM in KVM (gdm3 logs). Can't get Ubuntu VM to boot with all SSE cpuid flags disabled.
Comment 4 Michel Dänzer 2019-06-27 07:48:14 UTC
(In reply to maxamar from comment #2)
> That's really KVM by-design bug [...]

Please take it up with KVM folks then.
Comment 5 maxamar 2019-06-27 07:57:56 UTC
As i've checked must be at least SSE & SSE2 for it to boot. There were other reports on Proxmox forums which stated that with GPU it only works with with small RAM.

> Please take it up with KVM folks then.
For what - that's "by-design".

Related bug
https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=202643
Fix
https://svnweb.freebsd.org/ports?view=revision&revision=489754

> Disable use of SSE instructions in Xorg's xf86SlowBcopy() function.
>
> When such instructions are used to copy data from/to mapped video
> memory, some hypervisors (e.g. KVM, Microsoft Hyper-V) can generate
> SIGILL or SIGBUS exceptions, causing Xorg to crash.

memcpy & memmove should get the same fix
Comment 6 maxamar 2019-06-27 08:16:40 UTC
Other hypervisors should also benefit from this as real code must be faster than their emulation (that's how they possibly solve this).
Comment 7 Michel Dänzer 2019-06-27 08:17:49 UTC
Reassigning to Mesa, but TBH I wouldn't expect anything to happen anytime soon (unless you do it yourself). Mesa definitely wants to use an optimized memcpy on bare metal, so replacing memcpy everywhere is probably out of the question, and finding all code in Mesa where this could happen might be tricky.

Also note that APIs such as OpenGL or Vulkan expose such GPU mappings to applications directly, so the approach you're suggesting would likely require fixing a lot of application/framework code as well. It would most likely be less painful if this could be solved in KVM somehow, or if you just override the default memcpy implementation (with a known-good one) on your system.
Comment 8 maxamar 2019-06-27 08:24:30 UTC
Then maybe it would be better if glibc exposed API to mark regions of memory as non-SSE.
Comment 9 maxamar 2019-06-27 08:44:44 UTC
It seems that support for movups emulation had been added in 4.17 https://github.com/torvalds/linux/commit/29916968c48691c94be466a0b47cc9adcea9cb8d
Comment 10 Christian König 2019-06-27 08:47:11 UTC
Sorry but this is not a bug at all.

As Michel already noted core Vulkan as well as some OpenGL/OpenCL extensions mandate that the platform support all well aligned memory accesses to GPU local memory (VRAM).

If your platform (KVM in this case) can't do this for some reason you simply can't use that platform with this software.

In other words even if you replace memcpy/memset in Mesa with custom non SSE versions it is perfectly valid for an application to use SSE to access VRAM. And you can't change a binary application (which is actually just conforming to a standard).

The only possible workaround I can see in the driver is to not use VRAM at all for CPU mappings. That's actually rather easily doable, but would potentially cripple performance quite a bit.

I can point you to the necessary bits of code if you are interested in that.
Comment 11 maxamar 2019-06-27 09:05:55 UTC
(In reply to Christian König from comment #10)
> Sorry but this is not a bug at all.
> 
> As Michel already noted core Vulkan as well as some OpenGL/OpenCL extensions
> mandate that the platform support all well aligned memory accesses to GPU
> local memory (VRAM).
> 
> If your platform (KVM in this case) can't do this for some reason you simply
> can't use that platform with this software.
> 
> In other words even if you replace memcpy/memset in Mesa with custom non SSE
> versions it is perfectly valid for an application to use SSE to access VRAM.
> And you can't change a binary application (which is actually just conforming
> to a standard).
> 
> The only possible workaround I can see in the driver is to not use VRAM at
> all for CPU mappings. That's actually rather easily doable, but would
> potentially cripple performance quite a bit.
> 
> I can point you to the necessary bits of code if you are interested in that.

Yes and somehow Mesa uses "movups" instruction which is:
MOVUPS-Move Unaligned Packed Single-Precision Floating-Point

So is a bug.

Correct version is movaps which copies aligned data (is supported in KVM since long ago).

Yes it is in glibc and what's so - don't use it then.

KVM is part of Linux so must be supported.

Anyway upgrading kernel to 4.17 seems to solve the problem, needs a test.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.