Summary: | radv: VK_PIPELINE_CREATE_DISABLE_OPTIMIZATION_BIT and bringing down initial pipeline compile times | ||
---|---|---|---|
Product: | Mesa | Reporter: | Philip Rebohle <philip.rebohle> |
Component: | Drivers/Vulkan/radeon | Assignee: | Timothy Arceri <t_arceri> |
Status: | RESOLVED FIXED | QA Contact: | mesa-dev |
Severity: | normal | ||
Priority: | medium | ||
Version: | git | ||
Hardware: | Other | ||
OS: | All | ||
Whiteboard: | |||
i915 platform: | i915 features: | ||
Attachments: | volund-benchmark.sh |
Description
Philip Rebohle
2018-04-26 08:54:25 UTC
Hi Phillip, It's doable but it would require a non-trivial amount of work. And I'm not sure if the gain will be significant, especially if most of the time is spent in LLVM. Also, disabling optimizations will reduce CPU usage, of course, but this will increase GPU usage (until the optimized pipeline is ready), so not sure again. I'm open for discussions though. :) I'll take a look into this. We could probably turn off/limit a number of NIR passes without to much problem (such as the link time opts), it is also something we can likely improve incrementally. I'm not too sure how much we can dial down LLVM this will take some investigation. One concern (besides what Samuel has already mentioned) is that turning off some optimisations passes may trigger bugs that would normally be hidden. As long as scratch buffer support is robust, removing LLVM IR optimization passes is probably not a problem, though you really do want mem2reg and I don't think we spend much time in the others (at least radeonsi didn't, last time I checked). Using the -O0 settings for the codegen backend is a lot riskier. Our compute folks have done some work fixing bugs there, but I really wouldn't recommend it. (In reply to Nicolai Hähnle from comment #3) > As long as scratch buffer support is robust, removing LLVM IR optimization > passes is probably not a problem, though you really do want mem2reg and I > don't think we spend much time in the others (at least radeonsi didn't, last > time I checked). > > Using the -O0 settings for the codegen backend is a lot riskier. Our compute > folks have done some work fixing bugs there, but I really wouldn't recommend > it. Yeah I've done some experimenting with the Blacksmith demo. I'm not sure we can get much benefit implementing VK_PIPELINE_CREATE_DISABLE_OPTIMIZATION_BIT with the current state of things. Default: Sum of shader compile times: 325933 ms With only LLVM DCE opt (compilation fails without this): Sum of shader compile times: 326451 ms No NIR linking plus single pass over NIR opts (compilation fails without this): Sum of shader compile times: 294788 ms FWIW with llvmpipe (gallivm) we found that LICM can have very high cost (in particular the lcssa pass that comes with it). I think though it was mostly related to the main shader loop, which you don't have with radeonsi. Doing some experiments having early-cse near the beginning (after sroa) seemed to help somewhat, as it tends to make the IR simpler for the later passes at a small cost (albeit sroa itself can blow IR up quite a bit). sroa and early-cse at the beginning is also close to what off-line llvm opt -O2 would do. Albeit radeonsi already has the memssa version of early-cse before instcombine, so maybe that's sufficient... The -time-passes and -debug-pass=Structure tell you a lot what passes actually get run and how much time they need, these also work for codegen (llc). Of course that requires you dumped the bitcode somewhere out of the driver (but if it's just millions of small shaders I wouldn't really expect much in any case). If there's some guidelines which passes make sense to run in which order, I'd be definitely quite interested in that... (In reply to Timothy Arceri from comment #4) > (In reply to Nicolai Hähnle from comment #3) > > As long as scratch buffer support is robust, removing LLVM IR optimization > > passes is probably not a problem, though you really do want mem2reg and I > > don't think we spend much time in the others (at least radeonsi didn't, last > > time I checked). > > > > Using the -O0 settings for the codegen backend is a lot riskier. Our compute > > folks have done some work fixing bugs there, but I really wouldn't recommend > > it. > > Yeah I've done some experimenting with the Blacksmith demo. I'm not sure we > can get much benefit implementing > VK_PIPELINE_CREATE_DISABLE_OPTIMIZATION_BIT with the current state of > things. > > Default: > Sum of shader compile times: 325933 ms > > With only LLVM DCE opt (compilation fails without this): > Sum of shader compile times: 326451 ms > > No NIR linking plus single pass over NIR opts (compilation fails without > this): > Sum of shader compile times: 294788 ms I've done some playing around with the LLVM cogegen opt levels: LLVMCodeGenLevelNone + LLVMAddEarlyCSEMemSSAPass (compilation fails without this): Sum of shader compile times: 211403 ms However there are all sorts of rendering issues when running the demo. No NIR linking plus single pass over NIR opts (compilation fails without this), LLVMCodeGenLevelNone + LLVMAddEarlyCSEMemSSAPass(compilation fails without this): Sum of shader compile times: 179775 ms With this the demo doesn't actually display the graphics it just shows a flickering Unity logo throughout the run. (In reply to Timothy Arceri from comment #6) > (In reply to Timothy Arceri from comment #4) > > (In reply to Nicolai Hähnle from comment #3) > > > As long as scratch buffer support is robust, removing LLVM IR optimization > > > passes is probably not a problem, though you really do want mem2reg and I > > > don't think we spend much time in the others (at least radeonsi didn't, last > > > time I checked). > > > > > > Using the -O0 settings for the codegen backend is a lot riskier. Our compute > > > folks have done some work fixing bugs there, but I really wouldn't recommend > > > it. > > > > Yeah I've done some experimenting with the Blacksmith demo. I'm not sure we > > can get much benefit implementing > > VK_PIPELINE_CREATE_DISABLE_OPTIMIZATION_BIT with the current state of > > things. > > > > Default: > > Sum of shader compile times: 325933 ms > > > > With only LLVM DCE opt (compilation fails without this): > > Sum of shader compile times: 326451 ms > > > > No NIR linking plus single pass over NIR opts (compilation fails without > > this): > > Sum of shader compile times: 294788 ms > > I've done some playing around with the LLVM cogegen opt levels: > > LLVMCodeGenLevelNone + LLVMAddEarlyCSEMemSSAPass (compilation fails without > this): > Sum of shader compile times: 211403 ms > However there are all sorts of rendering issues when running the demo. > > No NIR linking plus single pass over NIR opts (compilation fails without > this), > LLVMCodeGenLevelNone + LLVMAddEarlyCSEMemSSAPass(compilation fails without > this): > Sum of shader compile times: 179775 ms > With this the demo doesn't actually display the graphics it just shows a > flickering Unity logo throughout the run. Ok so it seems this speed up (and the display issues that go with it) and due to switching from the GreedyRegisterAllocator to the FastRegisterAllocator. The fast register allocator stresses the spill logic a lot. I believe it basically spills at the end of every basic block and reloads at the start of every basic block. Plus it's not very well tested with AMDGPU, so this really isn't surprising. Here is an initial patch that turns down the level of NIR optimisations: https://patchwork.freedesktop.org/patch/221407/ The speed-ups are not huge but its a start. I added an initial implementation to a separate branch in DXVK: https://github.com/doitsujin/dxvk/tree/disable-opt-bit It currently does not use derivative pipelines (I'll have to re-implement that at some point) and the benchmark script attached to this bug report will count both the optimized and unoptimized pipelines, but so far it seems to work without major issues. RADV uses VK_PIPELINE_CREATE_DISABLE_OPTIMIZATION_BIT since a while. Closing. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.