Bug 93686

Summary: Performance improvement : Please consider hardware ɢᴘᴜ rendering in llvmpipe
Product: Mesa Reporter: ytrezq
Component: Mesa coreAssignee: mesa-dev
Status: CLOSED NOTABUG QA Contact: mesa-dev
Severity: enhancement    
Priority: medium CC: ytrezq
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: All   
Whiteboard:
i915 platform: i915 features:

Description ytrezq 2016-01-13 00:45:22 UTC
Purpose : combine software and hardware rendering for better speed performance.

Ok, I recognize this deserve an different purpose from the original one of llvmpipe, and this might be only possible with Vulkan (probably not with OpenGl). So I recognize this feature should be optional at run‑time.

The point is some programs are ɢᴘᴜ bound, and the ꜱɪᴍᴅ nature of llvmpipe would allow to make non negligible benefits, even if it’s slower than an integrated ɢᴘᴜ.
I recognize this would introduce a complex mechanism for load balancing between ɢᴘᴜ and ᴄᴘᴜ (avoid ᴄᴘᴜ bounding when there are place for ɢᴘᴜ processing and vice versa)

Though, my ɢᴘᴜ knowledge is strictly limited to what I learned with OpenCl and gaming, so I don’t know. But in the case it’s possible, it might worth discussion please.
Comment 1 Kenneth Graunke 2016-01-13 05:53:19 UTC
This has been proposed many times.  It's an idea that sounds nice on paper, but ends up being really complicated.  Nobody has ever come up with a good plan, as far as I know.  What should be done where?  How do we avoid having to synchronize the two frequently, killing performance?

It's not likely to happen any time soon.  It might be a viable academic research project, but it's not a sure bet.
Comment 2 Roland Scheidegger 2016-01-13 16:19:03 UTC
I'm not sure if this exact same proposal really came up already. We have seen some though asking if we couldn't combine llvmpipe with less capable gpus to make a driver offering more features, that is executing the stuff the gpu can't do with llvmpipe (but no, we really can't in any meaningful way).
This proposal sounds even more ambitious in some ways, I certainly agree we can't make it happen. With Vulkan, it may be the developers choice if multiple gpus are available which one to use for what, so theoretically there might be some way there to make something like that happen, but I've no idea there really (plus, unless you're looking at something like at least 5 year old low-end gpu vs. 8-core current high-end cpu, there'd still be no benefits even if that could be made to work). There is one thing llvmpipe is "reasonably good" at compared to gpus, which is shader arithmetic (at least for pixel shaders, not running in parallel for vertex ones, with tons of gotchas as we don't currently even optimize empty branches away), but there's just no way to separate that.
Comment 3 ytrezq 2016-01-14 13:40:10 UTC
(In reply to Roland Scheidegger from comment #2)
> I'm not sure if this exact same proposal really came up already. We have
> seen some though asking if we couldn't combine llvmpipe with less capable
> gpus to make a driver offering more features, that is executing the stuff
> the gpu can't do with llvmpipe (but no, we really can't in any meaningful
> way).
> This proposal sounds even more ambitious in some ways, I certainly agree we
> can't make it happen. With Vulkan, it may be the developers choice if
> multiple gpus are available which one to use for what, so theoretically
> there might be some way there to make something like that happen, but I've
> no idea there really (plus, unless you're looking at something like at least
> 5 year old low-end gpu vs. 8-core current high-end cpu, there'd still be no
> benefits even if that could be made to work). There is one thing llvmpipe is
> "reasonably good" at compared to gpus, which is shader arithmetic (at least
> for pixel shaders, not running in parallel for vertex ones, with tons of
> gotchas as we don't currently even optimize empty branches away), but
> there's just no way to separate that.

I don’t think it’s necessary to combine 5 years old low end 90nm gpu with a 14nm high end cpu. For example (comparing the hd 2500 integrated graphic of my ivy bridge cpu and the cpu itself), glxgears (both case in full screen) run at 301 frame per second with the  gpu and 221 frames per second with llvmpipe.
In a gpu intensive OpenGl 3D game (the game itself use 3% cpu), I got 11 frames per seconds with the gpu and 6 frames with llvmpipe.


I’m also not that sure about high level api : Nvidia is able to perform complete automatic load balancing with their proprietary drivers (at driver level) with both OpenGl and Direct3D. It works with all their gpu.

There’s also a well known example of perfect load balancing between several gpu and several cpu : OpenCl. Though I agree re writing some parts of llvmpipe in OpenCl might add more overhead than it removes. Plus if it would be possible, it might remove some of the required manpower to maintain low level hardware instructions.

Of course in the last two cases, I may be wrong because I might ignore what I’m really talking about.
Comment 4 Roland Scheidegger 2016-01-14 14:30:12 UTC
(In reply to ytrezq from comment #3)

> I don’t think it’s necessary to combine 5 years old low end 90nm gpu with a
> 14nm high end cpu. For example (comparing the hd 2500 integrated graphic of
> my ivy bridge cpu and the cpu itself), glxgears (both case in full screen)
> run at 301 frame per second with the  gpu and 221 frames per second with
> llvmpipe.
Don't use glxgears as a benchmark like that. This is usually limited by lots of other factors than "gpu" performance. The rendering is way too simple, the framerate way too high - things don't just scale linearly down from glxgears...

> In a gpu intensive OpenGl 3D game (the game itself use 3% cpu), I got 11
> frames per seconds with the gpu and 6 frames with llvmpipe.
llvmpipe may look good here, but I suspect that's got more to do again with something else rather than gpu performance. Maybe there's too much cpu<->gpu synchronization going on (which typically kills your performance, but is pretty much free for llvmpipe) or whatever.

> 
> I’m also not that sure about high level api : Nvidia is able to perform
> complete automatic load balancing with their proprietary drivers (at driver
> level) with both OpenGl and Direct3D. It works with all their gpu.
I'm not sure what you exactly mean here: They do "load balancing" with multiple gpus as part of SLI, but they are just rendering one frame on on gpu and the next one on the other, that's it, and it sometimes doesn't work well neither, the scaling isn't always decent. Now something like that _could_ theoretically be done with llvmpipe and some gpu I suppose, but it's not going to happen. None of the open-source drivers even deemed it worthwile enough even for multiple, same gpus yet...

> 
> There’s also a well known example of perfect load balancing between several
> gpu and several cpu : OpenCl. Though I agree re writing some parts of
> llvmpipe in OpenCl might add more overhead than it removes. Plus if it would
> be possible, it might remove some of the required manpower to maintain low
> level hardware instructions.
With OpenCL (as well as d3d12, Vulkan) multiple gpus are presented at the api level. Thus, the application can chose which adapter to use for what, which is pretty much the only way how this can work in a sane manner. There were some demos for d3d12 which did (IIRC) all the rendering on the discrete gpu and then ran some post-processing shader on the integrated graphics using exactly that. So, yes, theoretically if we'd support Vulkan, something like that would be doable, but only if the app decided to make use of it.

In theory, some rendering loads could be split other ways: for instance, I know at least some i965 windows drivers executed vertex shader on the cpu instead of the gpu (app dependent IIRC) because that was faster. But the IGP was really weak back then, relative to the cpu it has probably increased by more than a factor of 10 in performance. Plus, more modern graphics can't be split like that easily, since the fragment shader may use the same resources as the vertex shader, meaning you have to synchronize your buffers (which tends to kill performance).
Comment 5 ytrezq 2016-01-14 16:09:17 UTC
(In reply to Roland Scheidegger from comment #4)
> (In reply to ytrezq from comment #3)
> I'm not sure what you exactly mean here: They do "load balancing" with 
> multiple gpus as part of SLI
You don’t need an ꜱʟɪ gpu for that, you just need to be Nvidia designed and supported by a recent Nvidia driver (something above >250).
Then, this is only the matter of doing a right click on any apllication for pulling the context menu in windows® (you have the choice to run it on a particular gpu or use all gpus)

> With OpenCL (as well as d3d12, Vulkan) multiple gpus are presented at the
> api level. Thus, the application can chose which adapter to use for what,
> which is pretty much the only way how this can work in a sane manner. There
> were some demos for d3d12 which did (IIRC) all the rendering on the discrete
> gpu and then ran some post-processing shader on the integrated graphics
> using exactly that. So, yes, theoretically if we'd support Vulkan, something
> like that would be doable, but only if the app decided to make use of it.
I disagree, you can’t control invidual gpu in OpenCl. Either your program run on all gpu or all dedicated devices or all cpu.
Or the 3 of the above.

Yes this is true that the application can to only use a set of devices, but this is only a transparent choice which is completely handled by the api backend/driver.
In all case .cl (and their kernel) files don’t change. So their is no code re‑factoring of any kind.
This is much like requiring an OpenGl program to use llvmpipe instead of the hardware graphic library render with LIBGL_ALWAYS_SOFTWARE=1

Choosing the device in OpenCl is only matter of setting an integer between the values of CL_DEVICE_TYPE_CPU or CL_DEVICE_TYPE_GPU or CL_DEVICE_TYPE_ACCELERATOR or CL_DEVICE_TYPE_ALL (or choosing with a default).
This might even be possible to change the device set by replacing that integer in the binary file.
Some language libraries bindings for OpenCl even allow to override the choice outside of the program with environment variables

So with OpenCl, I simply can tell “Compute that addition on all devices” and writing the addition in an high level manner with simply “(int)((long)variable_1+variable_2)”.
I don’t think Vulkan or Direct3D can do this without a huge need for code re‑factoring.

That’s why I wrote that if some parts of llvmpipe could be implemented using OpenCl, then there would be no need to worry about were it is being run (forgetting performance considerations).
Comment 6 ytrezq 2016-01-15 13:12:56 UTC
At least, this work in that way for versions before OpenCl 1.2 (I only learned 1.1 sorry).
Comment 7 Marek Olšák 2016-01-15 16:01:05 UTC
I'll just tell you my opinion.

This is a bad idea.

It can't work.

It will never work.

It's doomed to fail.

Don't waste your time.

You need to study how most games draw things. You'll see rendering to textures a lot, not so much to the main framebuffer. A shader can read any texel from a texture that was rendered to. This means that any pixel from any render target must be available to all clients which can read it as a texture. This is why split-screen rendering fails. Alternate frame rendering has the same issue if there are inter-frame dependencies, e.g. a render target is only updated every other frame (reflections/mirrors), or there is motion blur, or a rain hitting the camera/screen, which is an incremental process, or any other incremental screen-space effect. This is why all hybrid solutions don't come up to expectations.

Get over it. Move on.
Comment 8 ytrezq 2016-01-15 20:37:00 UTC
(In reply to Marek Olšák from comment #7)
> I'll just tell you my opinion.
> 
> This is a bad idea.
> 
> It can't work.
> 
> It will never work.
> 
> It's doomed to fail.
> 
> Don't waste our time.
> 
Really ? does even exposing llvmpipe as a real ɢᴘᴜ in VulKan will be that bad ?
Comment 9 Jan Vesely 2016-01-15 22:51:13 UTC
If GPU<->CPU synchronization is the only problem then the idea should work for iGPUs with coherent access to main memory (HSA complaint?). Then again those GPUs probably won't need CPU help any time soon.
Comment 10 ytrezq 2016-01-16 00:54:02 UTC
(In reply to Jan Vesely from comment #9)
> If GPU<->CPU synchronization is the only problem then the idea should work
> for iGPUs with coherent access to main memory (HSA complaint?). Then again
> those GPUs probably won't need CPU help any time soon.

Intel or ᴀᴍᴅ ᴀᴘᴜs can’t (theoretically) do memory sharing between ᴄᴘᴜ ᴀɴᴅ ɢᴘᴜ.
Intel or ᴀᴍᴅ ᴀᴘᴜs use ᴘᴄɪe even if they are on the same die of the ᴄᴘᴜ. The ʙɪᴏꜱ  or the ᴜᴇꜰɪ firmware allocate a fixed portion of ʀᴀᴍ that appear as part of “hardware reserved memory” to the ᴏꜱ.
There’s no Unified memory, data has to use the internal ᴘᴄɪe bus. It’s the same as if the card was an external one with it’s own memory (though it can directly address ᴄᴘᴜ ram modules).

The main problem with ɢᴘᴜ that only use ᴄᴘᴜ memory is they suffer from bandwidth problem (so this like they always work with synchronizations problems) which is why adding a ʀᴀᴍ module can do double framerates. ᴀᴘᴜs only combine drawbacks.



Though the generation of Nvidia cards based on Pascal will be able to directly address ᴄᴘᴜ memory while keeping their own ɢᴅᴅʀ. This will effectively provide unified memory, but this time, the ʀᴀᴍ modules will stay behind a ᴘᴄɪe or Nvlink bus.


I agree with comment #1 this a work for a team of future developers in a ᴘhᴅ, not for a real software project
There’s probably lot of more than required manpower and synchronisations definitely looks like to not be the primary issue in the scenario of ɢᴘᴜ accelerated llvmpipe.


But sum those current 2.5 scenarios :
— Use one or several ɢᴘᴜ to accelerate some part of llvmpipe computing (issues were raised above).

— Combine the processing power of several ɢᴘᴜs and make appear llvmpipe as a real ɢᴘᴜ among others.
– do it transparently (this what first version of OpenCl do) (Nvidia is doing it with high‑level ᴀᴘɪs like Direct3D 11 or OpenGl as long as the card use a supported driver along the same ᴡᴅᴅᴍ version, they probably use other hardware things of ꜱʟɪ to do to that). Seems to be theoretically impossible.
– restrict it to Vulkan : very few devices will support this. You’ll need hardware which is less than 1 year old. This include computers and tablets. There’s also the problem of all previous work written in OpenGl. Only seems a good idea in some rare supercomputers scenarios.
Comment 11 Marek Olšák 2016-01-16 01:10:20 UTC
(In reply to ytrezq from comment #8)
> (In reply to Marek Olšák from comment #7)
> > I'll just tell you my opinion.
> > 
> > This is a bad idea.
> > 
> > It can't work.
> > 
> > It will never work.
> > 
> > It's doomed to fail.
> > 
> > Don't waste our time.
> > 
> Really ? does even exposing llvmpipe as a real ɢᴘᴜ in VulKan will be that
> bad ?

Yes. If you have a real GPU, any CPU rendering is a bad idea.
Comment 12 ytrezq 2016-01-16 01:25:54 UTC
(In reply to Marek Olšák from comment #11)
> Yes. If you have a real GPU, any CPU rendering is a bad idea.

I just thought 2 combined ᴄᴘᴜs or 2 combined ɢᴘᴜs were better than 1. Even in the case the power of 1 of the part has an insignificant power compared to others.

That’s why I wrote :
> Really ? does even exposing llvmpipe as a real ɢᴘᴜ in VulKan will be that
> bad ?
By comparison, Direct3D 12 developers don’t use software rendering when they use several ɢᴘᴜs, because the Microsoft Software rendering engine don’t show up as a ɢᴘᴜ.
(though the Microsoft rendering engine doesn’t use ꜱɪᴍᴅ at all contrary to llvmpipe).
I don’t see any reason for not getting rendering faster in that Vulkan case (It’s up to the programmer to decide to use multi‑ɢᴘᴜs configurations or not). At least the aim is to not make the llvmpipe ɢᴘᴜ different from others ɢᴘᴜs at the ᴀᴘɪ level.

It wasn’t in the idea in running Vulkan on only ᴄᴘᴜs of course.


I already wrote such feature should be made optional at run‑time. So there’s would be no problems for users fearing overheat.
Comment 13 Alex Deucher 2016-01-16 02:19:25 UTC
(In reply to ytrezq from comment #10)
> (In reply to Jan Vesely from comment #9)
> > If GPU<->CPU synchronization is the only problem then the idea should work
> > for iGPUs with coherent access to main memory (HSA complaint?). Then again
> > those GPUs probably won't need CPU help any time soon.
> 
> Intel or ᴀᴍᴅ ᴀᴘᴜs can’t (theoretically) do memory sharing between ᴄᴘᴜ ᴀɴᴅ
> ɢᴘᴜ.
> Intel or ᴀᴍᴅ ᴀᴘᴜs use ᴘᴄɪe even if they are on the same die of the ᴄᴘᴜ. The
> ʙɪᴏꜱ  or the ᴜᴇꜰɪ firmware allocate a fixed portion of ʀᴀᴍ that appear as
> part of “hardware reserved memory” to the ᴏꜱ.
> There’s no Unified memory, data has to use the internal ᴘᴄɪe bus. It’s the
> same as if the card was an external one with it’s own memory (though it can
> directly address ᴄᴘᴜ ram modules).
> 

I can't speak for intel, but on AMD APUs, while the GPU appears as a device on the PCIE bus, it actually has a much faster internal connection to the memory controller.
Comment 14 Marek Olšák 2016-01-16 10:58:45 UTC
(In reply to ytrezq from comment #12)
> (In reply to Marek Olšák from comment #11)
> > Yes. If you have a real GPU, any CPU rendering is a bad idea.
> 
> I just thought 2 combined ᴄᴘᴜs or 2 combined ɢᴘᴜs were better than 1. Even
> in the case the power of 1 of the part has an insignificant power compared
> to others.
> 
> That’s why I wrote :
> > Really ? does even exposing llvmpipe as a real ɢᴘᴜ in VulKan will be that
> > bad ?
> By comparison, Direct3D 12 developers don’t use software rendering when they
> use several ɢᴘᴜs, because the Microsoft Software rendering engine don’t show
> up as a ɢᴘᴜ.
> (though the Microsoft rendering engine doesn’t use ꜱɪᴍᴅ at all contrary to
> llvmpipe).
> I don’t see any reason for not getting rendering faster in that Vulkan case
> (It’s up to the programmer to decide to use multi‑ɢᴘᴜs configurations or
> not). At least the aim is to not make the llvmpipe ɢᴘᴜ different from others
> ɢᴘᴜs at the ᴀᴘɪ level.
> 
> It wasn’t in the idea in running Vulkan on only ᴄᴘᴜs of course.
> 
> 
> I already wrote such feature should be made optional at run‑time. So there’s
> would be no problems for users fearing overheat.

You are completely missing the point. The main concern is that applications may try to use all available renderers, including llvmpipe if it's present. The problem is that llvmpipe would significantly slow down drawing because of its slow rendering and high overhead. I know from experience that if applications are given a way to hurt their performance, they will eagerly take it. And everybody will blame Linux. Everybody always blames Linux for their problems.
Comment 15 ytrezq 2016-01-16 12:25:24 UTC
(In reply to Alex Deucher from comment #13)
> I can't speak for intel, but on AMD APUs, while the GPU appears as a device
> on the PCIE bus, it actually has a much faster internal connection to the
> memory controller.
You’re still confusing things. Of course they use the same memory controller directly. Of course they share the same memory modules.
However they can’t read or write in memory of each others. So this behave like an external card ᴘᴄɪe card with it’s own memory modules (if we forget the bandwidth is also shared with an another device so each ones slow each others).

So if you want to send or receive data it can only happens over the ᴘᴄɪe bus, triggering the same synchronisations problems of external chipsets due to the bus bandwidth and instructions overhead.

Unified memory will only happen in future generations of graphics cards, but only behind a ᴘᴄɪe bus (which will slow things because there’s still the memory controller bus adding overhead), so we’re far from the time were ɢᴘᴜs of ᴀᴘᴜs will be able to access ʀᴀᴍ of ᴄᴘᴜs with the same overhead (at that time it’s expected ʀᴀᴍ modules would have been merged into the ᴄᴘᴜ chip meaning there would be no longer separated ʀᴀᴍ modules).
Comment 16 ytrezq 2016-01-16 12:41:00 UTC
(In reply to Alex Deucher from comment #13)
> I can't speak for intel, but on AMD APUs, while the GPU appears as a device
> on the PCIE bus, it actually has a much faster internal connection to the
> memory controller.

For simplifying let’s pick up the comparison of 2 virtual machines :

each one is well parametrized so they none of them can fill the ram, even if they are alone (ex :max use 4Gb of ʀᴀᴍ and 64Gb available on the host).

Does the first virtual machine can read the ʀᴀᴍ of the second ? No !
Does the second virtual machine can read the ʀᴀᴍ of the first ? Neither !
In both case they share the same memory hardware.

The same apply to ɢᴘᴜs with ᴄᴘᴜs on ᴀᴘᴜs but with the hypervisor ᴏꜱ  being replaced by a pure hardware one, so there’s no software parts (with the exception of the memory allocated memory ratio being controllable in the ʙɪᴏꜱ  or the ᴜᴇꜰɪ firmware)
Comment 17 ytrezq 2016-01-16 13:25:18 UTC
(In reply to Marek Olšák from comment #14)
> You are completely missing the point. The main concern is that applications
> may try to use all available renderers, including llvmpipe if it's present.
> The problem is that llvmpipe would significantly slow down drawing because
> of its slow rendering and high overhead. I know from experience that if
> applications are given a way to hurt their performance, they will eagerly
> take it. And everybody will blame Linux. Everybody always blames Linux for
> their problems.

Only because it’s the ᴄᴘᴜ (if yes why?) or it’s because mixing slower ɢᴘᴜ hardware slows down faster ɢᴘᴜs in general (imagining the case of an intel ɢᴍᴀ with a geforce 7xx if they would support Vulkan (which won’t happen)) ?
As I never heard such thing for OpenCl, I guess it’s the first case (again why ?)

Otherwise, yes the aim is that applications try to use all available renderers, llvmpipe included. (I thought adding a renderer even it’s llvmpipe would be able to make things faster)

In the second case, then, it might only be useful in the rare case of an old ɢᴘᴜ with a fast modern ɢᴘᴜ.
In the meantime, I’m not aware of any desktop supporting ꜱꜱᴇ4.2 and being able to output video without a ɢᴘᴜ. So OpenGl support for llvmpipe is already for rare cases (e.g a supercomputer without a set of graphic cards being occasionally used for graphics rendering)

I also can’t imagine users blaming Linux after intentionally setting an environment variable and complaining things runs slower compared to when it’s not set.
Comment 18 Marek Olšák 2016-01-16 14:15:42 UTC
A fast GPU + slow GPU is also a bad idea. There is the risk of improper load balancing resulting in the performance being the same or slightly above the slow GPU. Or if it's done badly, it can be well below the slow GPU.

(In reply to ytrezq from comment #17)
> So OpenGl support for llvmpipe is
> already for rare cases (e.g a supercomputer without a set of graphic cards
> being occasionally used for graphics rendering)

You are also completely missing the main use case for llvmpipe: to have desktop compositing if an OpenGL driver isn't installed or doesn't exist. Nobody cares about supercomputers.
Comment 19 ytrezq 2016-01-16 14:34:39 UTC
(In reply to Marek Olšák from comment #18)
> A fast GPU + slow GPU is also a bad idea. There is the risk of improper load
> balancing resulting in the performance being the same or slightly above the
> slow GPU. Or if it's done badly, it can be well below the slow GPU.
So why mixing a ᴄᴘᴜ and a fast ɢᴘᴜ isn’t a problem in OpenCl ?
> You are also completely missing the main use case for llvmpipe: to have
> desktop compositing if an OpenGL driver isn't installed or doesn't exist.
> Nobody cares about supercomputers.

llvmpipe is essentially an opersource driver itself (the difference is only not requiring in‑kernel part). So generally when a modern Linux distro ship llvmpipe, it also makes sure Opensource drivers (with their kernel modules) for Nvidia ; Intel ; ᴀᴍᴅ are also installed and loaded.

Hence the LIBGL_ALWAYS_SOFTWARE=1 so widely used on phoronix.com for benchmarking llvmpipe.
Comment 20 ytrezq 2016-01-16 14:41:03 UTC
Currently the main case of llvmpipe I’m seeing is llvmpipe being used as the graphic rendering engine because of miss automatic configuration whereas everything is installed to use the hardware ɢᴘᴜ.

(simply search on Google on how to force enabling llvmpipe and you’ll only see posts asking to disable it for using their hardware ɢᴘᴜ)
Comment 21 Jose Fonseca 2016-01-16 15:07:47 UTC
I'm one of llvmpipe authors (particularly the initial bring up of the driver).


Enabling llvmpipe to run on top of, or side-by-side with, GPUs or clusters via OpenCL/whatever is not something we're interested.


OpenCL is inadequate for 3D graphics, and it also abstracts away too much of the CPU details to be useful for the highly optimized x86 code in llvmpipe.  If somebody wants to use GPUs for 3D graphics, they should use the 3D graphics GPU drivers.

Mixing GPUs with something else is also pointless as others have pointed here.  _Even_ if it made sense from a performance POV (which does not), it's impossible merely from a correctness POV -- you'd have rasterzation differences, depth fighting, all sort of nasty issues, which all together are insurmountable.  (I.e, take many life times of work to nail and for little or no benefit.)

If somebody wants to fork llvmpipe and pursue it they're free to do so, but there's no way we would merge any of that work back into llvmpipe (or likely not even Mesa).



The only exception IMO, is Xeon Phi -- it looks like a GPU in some regards, but it has a x86-like ISA and runs its own OS --, so it wouldn't be too much of a stretch to port llvmpipe to run inside the Xeon Phi micro OS.  In order for this to be useful we'd need to have a thin transport gallium driver that would runs on the host OS and communicates with the llvmpipe driver in the Xeon Phi.  That is, mimic the Larrabee architecture (but this time without any of the GPU fixed function that Larrabee had like texture caches.)


We have no plan to work on this ourselves -- performance would never beat a dedicated GPU with 3D graphics specific circuits --- but it's a cool project and not disruptive, so if somebody wanted to pursue this, I think this is something we could accommodate.


Of course, this is not what the bug reporter asked for: llvmpipe would only run inside Xeon Phi, it would not cooperate with another llvmpipe instance on the host.



BTW, what you asked has been attempted -- http://chromium.sourceforge.net/
Comment 22 ytrezq 2016-01-16 15:37:47 UTC
(In reply to Jose Fonseca from comment #21)
> OpenCL is inadequate for 3D graphics, and it also abstracts away too much of
> the CPU details to be useful for the highly optimized x86 code in llvmpipe. 
> If somebody wants to use GPUs for 3D graphics, they should use the 3D
> graphics GPU drivers.
Yeah, OpenCl is only useful for general purposes computations
> Mixing GPUs with something else is also pointless as others have pointed
> here.  _Even_ if it made sense from a performance POV (which does not), it's
> impossible merely from a correctness POV -- you'd have rasterzation
> differences, depth fighting, all sort of nasty issues, which all together
> are insurmountable.
This is what I thought : each technical issue is solvable. But put together it turns lack of manpower in a blocking state. (Though in the case in the case of Vulkan, I still think many users will try to get a better performance by combining the processing power of their integrated intel ʜᴅ with a Geforce 1000 combined with a top modern ᴀᴍᴅ card (this use case of slow ɢᴘᴜ with fast ɢᴘᴜ is even advertised for Direct3D 12))
> 
> The only exception IMO, is Xeon Phi -- it looks like a GPU in some regards,
> but it has a x86-like ISA and runs its own OS --, so it wouldn't be too much
> of a stretch to port llvmpipe to run inside the Xeon Phi micro OS.  In order
> for this to be useful we'd need to have a thin transport gallium driver that
> would runs on the host OS and communicates with the llvmpipe driver in the
> Xeon Phi.  That is, mimic the Larrabee architecture (but this time without
> any of the GPU fixed function that Larrabee had like texture caches.)
> 
> 
> We have no plan to work on this ourselves -- performance would never beat a
> dedicated GPU with 3D graphics specific circuits --- but it's a cool project
> and not disruptive, so if somebody wanted to pursue this, I think this is
> something we could accommodate.
> 
I have doubts for the ꜱɪᴍᴅ nature in that case (since llvmpipe heavily rely on ꜱɪᴍᴅ) : isn’t Xeon phi internally a set of the legacy pentium pro burned on the same chip ?
> 
> BTW, what you asked has been attempted -- http://chromium.sourceforge.net/
A Google search on chromium revealed nothing : could give more detailed links please ?
Comment 23 Jan Vesely 2016-01-16 15:56:46 UTC
(In reply to ytrezq from comment #16)
> (In reply to Alex Deucher from comment #13)
> > I can't speak for intel, but on AMD APUs, while the GPU appears as a device
> > on the PCIE bus, it actually has a much faster internal connection to the
> > memory controller.
> 
> For simplifying let’s pick up the comparison of 2 virtual machines :
> 
> each one is well parametrized so they none of them can fill the ram, even if
> they are alone (ex :max use 4Gb of ʀᴀᴍ and 64Gb available on the host).
> 
> Does the first virtual machine can read the ʀᴀᴍ of the second ? No !
> Does the second virtual machine can read the ʀᴀᴍ of the first ? Neither !
> In both case they share the same memory hardware.
> 
> The same apply to ɢᴘᴜs with ᴄᴘᴜs on ᴀᴘᴜs but with the hypervisor ᴏꜱ  being
> replaced by a pure hardware one, so there’s no software parts (with the
> exception of the memory allocated memory ratio being controllable in the
> ʙɪᴏꜱ  or the ᴜᴇꜰɪ firmware)

the comparison with VMs is wrong, and the information about APUs is also wrong.

AMD APUs have complete access to the entire physical memory of the system. they use both coherent and non-coherent links which are faster (higher bw + lower latency) than PCIe lanes. see [0] if you want to learn more about APU memory (it's a bit dated so the numbers are different for latest products).

coherent memory access allows you to avoid synchronization overhead (or pay it on every access). the reason to mention APUs is that the difference between "GPU memory" and coherent system ram is much smaller than dGPUs. the reason to mention HSA is because they implement this approach wrt. compute; agents with different capabilities sharing coherent view of the memory.


[0] http://developer.amd.com/wordpress/media/2013/06/1004_final.pdf
Comment 24 Alex Deucher 2016-01-16 16:04:53 UTC
(In reply to ytrezq from comment #15)
> (In reply to Alex Deucher from comment #13)
> > I can't speak for intel, but on AMD APUs, while the GPU appears as a device
> > on the PCIE bus, it actually has a much faster internal connection to the
> > memory controller.
> You’re still confusing things. Of course they use the same memory controller
> directly. Of course they share the same memory modules.
> However they can’t read or write in memory of each others. So this behave
> like an external card ᴘᴄɪe card with it’s own memory modules (if we forget
> the bandwidth is also shared with an another device so each ones slow each
> others).
> 
> So if you want to send or receive data it can only happens over the ᴘᴄɪe
> bus, triggering the same synchronisations problems of external chipsets due
> to the bus bandwidth and instructions overhead.
> 
> Unified memory will only happen in future generations of graphics cards, but
> only behind a ᴘᴄɪe bus (which will slow things because there’s still the
> memory controller bus adding overhead), so we’re far from the time were ɢᴘᴜs
> of ᴀᴘᴜs will be able to access ʀᴀᴍ of ᴄᴘᴜs with the same overhead (at that
> time it’s expected ʀᴀᴍ modules would have been merged into the ᴄᴘᴜ chip
> meaning there would be no longer separated ʀᴀᴍ modules).

No, it happens right now.  The stolen memory used for APU "vram" is mainly for vbios splash screen post messages to minimize the amount of gpu setup required in the vbios and to provide contiguous memory which is slightly faster than going through an MMU.  Once the APU driver has initialized it can map system memory directly via the GPU's MMU.  Access to that memory does not go over the pcie bus.  The GPU has a direct internal link the system memory similar to what the CPU has.
Comment 25 ytrezq 2016-01-16 16:42:37 UTC
(In reply to Alex Deucher from comment #24)
> The GPU has a direct internal link the system memory similar to what the CPU
> has
Never said the contrary
Comment 26 ytrezq 2016-01-16 16:45:15 UTC
Concerning synchronisation, it don’t removes shared bandwidth with the ᴄᴘᴜ http://superuser.com/q/789816/282033
Comment 27 ytrezq 2016-01-16 17:04:56 UTC
At least for intel ᴀᴘᴜs there’s ᴘᴄɪe bus :
http://i.stack.imgur.com/Gcwi4.png
Comment 28 Jason Ekstrand 2016-01-16 17:09:13 UTC
(In reply to ytrezq from comment #15)
> (In reply to Alex Deucher from comment #13)
> > I can't speak for intel, but on AMD APUs, while the GPU appears as a device
> > on the PCIE bus, it actually has a much faster internal connection to the
> > memory controller.
> You’re still confusing things. Of course they use the same memory controller
> directly. Of course they share the same memory modules.
> However they can’t read or write in memory of each others. So this behave
> like an external card ᴘᴄɪe card with it’s own memory modules (if we forget
> the bandwidth is also shared with an another device so each ones slow each
> others).
> 
> So if you want to send or receive data it can only happens over the ᴘᴄɪe
> bus, triggering the same synchronisations problems of external chipsets due
> to the bus bandwidth and instructions overhead.
> 
> Unified memory will only happen in future generations of graphics cards, but
> only behind a ᴘᴄɪe bus (which will slow things because there’s still the
> memory controller bus adding overhead), so we’re far from the time were ɢᴘᴜs
> of ᴀᴘᴜs will be able to access ʀᴀᴍ of ᴄᴘᴜs with the same overhead (at that
> time it’s expected ʀᴀᴍ modules would have been merged into the ᴄᴘᴜ chip
> meaning there would be no longer separated ʀᴀᴍ modules).

This is just plain false.  Intel GPUs have, for a very long time shared memory with the CPU.  They both have access to the exact same physical pages.  The memory segregation you are referring to was a quirk of the Windows drivers of the time.  The Linus driver, as far as I know, has always allowed the GPU and CPU to freely access the same memory.  On big-core systems (i3, i5, i7), they even share the same L3 cache so access is coherent between the two.

(In reply to ytrezq from comment #26)
> Concerning synchronisation, it don’t removes shared bandwidth with the ᴄᴘᴜ
> http://superuser.com/q/789816/282033

Please don't post links to random user focussed Q&A sites or forums to try and explain to us how our hardware works.  The people you are talking to here are driver developers!  We (AMD and Intel at least) have access to the real documentation and know the hardware details very well.
Comment 29 Jason Ekstrand 2016-01-16 17:12:10 UTC
(In reply to ytrezq from comment #27)
> At least for intel ᴀᴘᴜs there’s ᴘᴄɪe bus :
> http://i.stack.imgur.com/Gcwi4.png

That picture is a lie.  It shows up on the bus, yes, but only because that makes it easier to deal with from software.  (PCI is a well-defined protocol that operating systems know what to do with.). Its access to memory is side-band and doesn't actually go through the bus.
Comment 30 Marek Olšák 2016-01-16 19:26:48 UTC
(In reply to ytrezq from comment #19)
> llvmpipe is essentially an opersource driver itself (the difference is only
> not requiring in‑kernel part). So generally when a modern Linux distro ship
> llvmpipe, it also makes sure Opensource drivers (with their kernel modules)
> for Nvidia ; Intel ; ᴀᴍᴅ are also installed and loaded.

You can't be more wrong. I wish you said something that is correct, but sadly that's probably not gonna happen.

llvmpipe doesn't need any hardware-specific drivers, not even kernel drivers. The kernel and/or X can light up the display and change resolutions without knowing or caring what hardware it's running on.
Comment 31 ytrezq 2016-01-16 20:07:27 UTC
(In reply to Marek Olšák from comment #30)
> 
> You can't be more wrong. I wish you said something that is correct, but
> sadly that's probably not gonna happen.
> 
> llvmpipe doesn't need any hardware-specific drivers, not even kernel
> drivers. The kernel and/or X can light up the display and change resolutions
> without knowing or caring what hardware it's running on.

I already told that for kernel. I recognize I used the wrong word : llvmpipe is definitely partially dependent (ꜱɪᴍᴅ instructions)
Comment 32 Marek Olšák 2016-01-16 20:13:39 UTC
(In reply to ytrezq from comment #31)
> I already told that for kernel. I recognize I used the wrong word : llvmpipe
> is definitely partially dependent (ꜱɪᴍᴅ instructions)

Yes, dependent on the CPU. :)
Comment 33 ytrezq 2016-01-16 21:44:34 UTC
(In reply to Marek Olšák from comment #32) 
> Yes, dependent on the CPU. :)
Hardware dependent, like the user mode libraries for controlling ɢᴘᴜ kernel ᴅʀᴍ.
Comment 34 Daniel Stone 2016-01-18 12:21:51 UTC
@ytrezq: Please stop CCing so many people. Everyone involved is already aware of the discussion (and has already come to a set view on this which is unlikely to change); adding further CCs will be considered as abuse and may lead to access being removed. Thanks.
Comment 35 Christian König 2016-01-18 13:22:50 UTC
Yeah, agree I was just about to complain on the traffic.

Anyway the opinions on this topic were already explained numerous times. While I only skimmed over it I completely agree with Marek and the other core developers that this idea doesn't sound valuable.

Let's close this.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.