Created attachment 139124 [details] volund-benchmark.sh Hello, a general issue encountered in DXVK is that pipelines have to be compiled at draw time, as the pipeline state and the combination of shaders used for rendering are not known in advance. This leads to noticable stutter in a lot of games when the shader cache is cold. The attached script measures the pipeline compile times of the Unity Blacksmith demo, which can be downloaded here: https://blogs.unity3d.com/2015/06/24/releasing-the-blacksmith/ I picked this demo because it warms the shader cache as part of its loading process, making for a reproducible test case. On my Ryzen 2700X setup, all the vkCreateGraphicsPipelines calls combined take about four minutes using the "Higher" preset on mesa 18.0.1, llvm 6.0 and latest dxvk-master, where about 2/3rds of the total CPU time is spent inside LLVM, and 1/6th inside libvulkan-radeon. An optimization I have in mind for DXVK would be to use VK_PIPELINE_CREATE_DISABLE_OPTIMIZATION_BIT in order to reduce the initial compile times, and to compile an optimized version of the pipeline asynchronously in the background. However, this requires driver support for VK_PIPELINE_CREATE_DISABLE_OPTIMIZATION_BIT. Please let me know if this is a viable option and if significant gains are to be expected when implementing support for this flag in the driver.
Hi Phillip, It's doable but it would require a non-trivial amount of work. And I'm not sure if the gain will be significant, especially if most of the time is spent in LLVM. Also, disabling optimizations will reduce CPU usage, of course, but this will increase GPU usage (until the optimized pipeline is ready), so not sure again. I'm open for discussions though. :)
I'll take a look into this. We could probably turn off/limit a number of NIR passes without to much problem (such as the link time opts), it is also something we can likely improve incrementally. I'm not too sure how much we can dial down LLVM this will take some investigation. One concern (besides what Samuel has already mentioned) is that turning off some optimisations passes may trigger bugs that would normally be hidden.
As long as scratch buffer support is robust, removing LLVM IR optimization passes is probably not a problem, though you really do want mem2reg and I don't think we spend much time in the others (at least radeonsi didn't, last time I checked). Using the -O0 settings for the codegen backend is a lot riskier. Our compute folks have done some work fixing bugs there, but I really wouldn't recommend it.
(In reply to Nicolai Hähnle from comment #3) > As long as scratch buffer support is robust, removing LLVM IR optimization > passes is probably not a problem, though you really do want mem2reg and I > don't think we spend much time in the others (at least radeonsi didn't, last > time I checked). > > Using the -O0 settings for the codegen backend is a lot riskier. Our compute > folks have done some work fixing bugs there, but I really wouldn't recommend > it. Yeah I've done some experimenting with the Blacksmith demo. I'm not sure we can get much benefit implementing VK_PIPELINE_CREATE_DISABLE_OPTIMIZATION_BIT with the current state of things. Default: Sum of shader compile times: 325933 ms With only LLVM DCE opt (compilation fails without this): Sum of shader compile times: 326451 ms No NIR linking plus single pass over NIR opts (compilation fails without this): Sum of shader compile times: 294788 ms
FWIW with llvmpipe (gallivm) we found that LICM can have very high cost (in particular the lcssa pass that comes with it). I think though it was mostly related to the main shader loop, which you don't have with radeonsi. Doing some experiments having early-cse near the beginning (after sroa) seemed to help somewhat, as it tends to make the IR simpler for the later passes at a small cost (albeit sroa itself can blow IR up quite a bit). sroa and early-cse at the beginning is also close to what off-line llvm opt -O2 would do. Albeit radeonsi already has the memssa version of early-cse before instcombine, so maybe that's sufficient... The -time-passes and -debug-pass=Structure tell you a lot what passes actually get run and how much time they need, these also work for codegen (llc). Of course that requires you dumped the bitcode somewhere out of the driver (but if it's just millions of small shaders I wouldn't really expect much in any case). If there's some guidelines which passes make sense to run in which order, I'd be definitely quite interested in that...
(In reply to Timothy Arceri from comment #4) > (In reply to Nicolai Hähnle from comment #3) > > As long as scratch buffer support is robust, removing LLVM IR optimization > > passes is probably not a problem, though you really do want mem2reg and I > > don't think we spend much time in the others (at least radeonsi didn't, last > > time I checked). > > > > Using the -O0 settings for the codegen backend is a lot riskier. Our compute > > folks have done some work fixing bugs there, but I really wouldn't recommend > > it. > > Yeah I've done some experimenting with the Blacksmith demo. I'm not sure we > can get much benefit implementing > VK_PIPELINE_CREATE_DISABLE_OPTIMIZATION_BIT with the current state of > things. > > Default: > Sum of shader compile times: 325933 ms > > With only LLVM DCE opt (compilation fails without this): > Sum of shader compile times: 326451 ms > > No NIR linking plus single pass over NIR opts (compilation fails without > this): > Sum of shader compile times: 294788 ms I've done some playing around with the LLVM cogegen opt levels: LLVMCodeGenLevelNone + LLVMAddEarlyCSEMemSSAPass (compilation fails without this): Sum of shader compile times: 211403 ms However there are all sorts of rendering issues when running the demo. No NIR linking plus single pass over NIR opts (compilation fails without this), LLVMCodeGenLevelNone + LLVMAddEarlyCSEMemSSAPass(compilation fails without this): Sum of shader compile times: 179775 ms With this the demo doesn't actually display the graphics it just shows a flickering Unity logo throughout the run.
(In reply to Timothy Arceri from comment #6) > (In reply to Timothy Arceri from comment #4) > > (In reply to Nicolai Hähnle from comment #3) > > > As long as scratch buffer support is robust, removing LLVM IR optimization > > > passes is probably not a problem, though you really do want mem2reg and I > > > don't think we spend much time in the others (at least radeonsi didn't, last > > > time I checked). > > > > > > Using the -O0 settings for the codegen backend is a lot riskier. Our compute > > > folks have done some work fixing bugs there, but I really wouldn't recommend > > > it. > > > > Yeah I've done some experimenting with the Blacksmith demo. I'm not sure we > > can get much benefit implementing > > VK_PIPELINE_CREATE_DISABLE_OPTIMIZATION_BIT with the current state of > > things. > > > > Default: > > Sum of shader compile times: 325933 ms > > > > With only LLVM DCE opt (compilation fails without this): > > Sum of shader compile times: 326451 ms > > > > No NIR linking plus single pass over NIR opts (compilation fails without > > this): > > Sum of shader compile times: 294788 ms > > I've done some playing around with the LLVM cogegen opt levels: > > LLVMCodeGenLevelNone + LLVMAddEarlyCSEMemSSAPass (compilation fails without > this): > Sum of shader compile times: 211403 ms > However there are all sorts of rendering issues when running the demo. > > No NIR linking plus single pass over NIR opts (compilation fails without > this), > LLVMCodeGenLevelNone + LLVMAddEarlyCSEMemSSAPass(compilation fails without > this): > Sum of shader compile times: 179775 ms > With this the demo doesn't actually display the graphics it just shows a > flickering Unity logo throughout the run. Ok so it seems this speed up (and the display issues that go with it) and due to switching from the GreedyRegisterAllocator to the FastRegisterAllocator.
The fast register allocator stresses the spill logic a lot. I believe it basically spills at the end of every basic block and reloads at the start of every basic block. Plus it's not very well tested with AMDGPU, so this really isn't surprising.
Here is an initial patch that turns down the level of NIR optimisations: https://patchwork.freedesktop.org/patch/221407/ The speed-ups are not huge but its a start.
I added an initial implementation to a separate branch in DXVK: https://github.com/doitsujin/dxvk/tree/disable-opt-bit It currently does not use derivative pipelines (I'll have to re-implement that at some point) and the benchmark script attached to this bug report will count both the optimized and unoptimized pipelines, but so far it seems to work without major issues.
RADV uses VK_PIPELINE_CREATE_DISABLE_OPTIMIZATION_BIT since a while. Closing.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.