22:54 -!- Irssi: Starting query in freenode with ickle 22:55 From you on bugzilla: 22:55 Any new bo returned to userspace is zeroed. That data may entirely reside in the CPU cache, and may remain there across the GPU write into main memory. The clflush performed before the read to invalidate the cache may in fact just cause the writeback of the cache to main memory, overwriting the results. 22:55 22:55 If this is true then I think we have a semi-major security hole. 22:57 If the zeroing really does live in the CPU cache then it's possible for a client to create two BOs, blit from one to the other without updating the source BO's domain, and the destination BO will now contain a copy of the contents of the source BO's pages pre-zero. 22:57 We can't trust the client's domain tracking. 22:58 The only option is to either clflush or do a non-temporal zero operation to ensure that the physical pages are actually zeroed before we ever let a client touch it. Day changed to 07 Jul 2017 02:14 if you let the kernel do its thing, then nope. 02:15 if you take over, then you need to clflush yourself, even if we added a new create ioctl to return a coherent buffer, there's still the old api hanging around 09:05 If we have any uapi that allows a client to scrape truely undefined contents from memory, it's a kernel bug. 09:06 It doesn't matter that userspace isn't "letting the kernel do its thing" 09:06 it doesn't. 09:06 if you use WC, the physical tags align with the cache 09:06 if you use GTT, first access forces the clflush 09:08 there are further problems with speculative CPU caching 09:10 the recent issue is ASYNC, and the choice is between never allowing explicit fencing or assuming the userspace driver is doing its part 09:10 I'm confused. If the reason why stuff started failing in Vulkan was because the kernel's zeroing of memory got flushed out *after* the GPU rendered to it, then what's to prevent the gpu from reading it *before* the zeroing gets flushed to main memory? 09:27 http://paste.debian.net/975349/ 09:29 Ugh... 09:29 I'm fine with taking responsibility for synchronization of CPU and GPU. 09:30 What i'm not fine with is the kernel having an API that lets me read other processes pages. 09:30 A simple clflush on BO alloc is suffucient. 09:31 but when? we either scrap ASYNC (either completely or add a check) or you do a set-domain on create and leave the kernel completely ignorant 09:32 Ok, maybe I don't understand what that patch does. 09:33 Maybe the right thing to do is just implicitly move to GTT whenever ASYNC is set. 09:33 But that seems a little weird 09:33 that the normal mode, but we use an async task with a fence 09:34 which is where the fun with using ASYNC to ignore all those implicit fences begins 09:34 Ok, let me make sure I have this straight. The kernel does a CPU write to a user's BO in one of 3 circumstances: initial zeroing, relocations, and pwrite. 09:35 swapping 09:35 Ok, 4. 09:35 I hadn't thought about swapping but that's good to mention. 09:35 and only relocations if the bo is in the CPU domain 09:35 Right 09:36 pwrite is also only if in CPU domain 09:36 Otherwise it uses a GTT or WB map? 09:36 GTT 09:36 k 09:36 (plonking a page into the GTT is actually cheaper than creating a WC mapping for a page) 09:37 Next question: Under what circumstances does the BO domain change? 09:37 Can the kernel change it out from under me? Can some other userspace with which I've shared the buffer change it? 09:37 Also, what's the difference between GTT domain and RENDER/SAMPLER? 09:37 it a single domain for all userspace, so shared buffers are a nightmare 09:38 the only time it can change from !userspace action is random swapping 09:38 Which changes it to CPU domain? 09:38 there's a small difference, in that GTT writes are not fully coherent 09:38 yes, swapping -> CPU 09:39 In what way are GTT writes not fully cohereng? 09:39 *coherent 09:39 the domains are effective CPU (in cpu cache), GTT (in chipset buffering), NONE 09:39 Ok 09:40 if you do a write through GTT it takes a finite amount of time to land in the backing page, during which you can still read the old result 09:40 Makes sense. 09:40 (we used to assume that was impossible, but modern Atoms) 09:40 So if you write through GTT and then try to read through CPU you may race. 09:40 yes, and also possible via the GPU 09:41 but we make sure we always hammer before the execbuf 09:41 Ok 09:43 Given the above conversation, I think I'm now convinced that we can't just do a single set_domain because some other process may have moved stuff around or into the CPU domain. 09:43 Or the kernel may have swapped it 09:44 So we need to set the GTT domain every time we kick it off to the kernel. 09:44 true, it's definitely the safe thing to do, my goal is simply not doing anything we don't need inside the kernel 09:44 Agreed. The kernel should do as little as possible to ensure correctness. 09:45 One (somewhat terrible) option would be to just clflush on BO creation. 09:45 That would provide the needed safety guarantees and still let us avoid the tracking. 09:46 no, there's the swapping argument again 09:46 One day in the future, we can zero our own pages and just use a non-temporal zero instruction. 09:46 Oh... 09:46 userspace simply doesn't know when the kernel touched it in that scenario 09:46 Right, so the kernel either needs to do some implicit domain tracking or else use non-temporal moves to swap it in. 09:47 Which I wouldn't be surprised if that's impossible in the kernel architecture. 09:47 nothing is impossible, but that would be very hard to get in 09:47 Right 09:47 Easier to just notice the swap-in and mark it as CPU domain? 09:49 In that case, I think the best thing to do is probably to take anything that's ASYNC and in the CPU domain and implicitly move it to the GTT domain. 09:49 And userspace will just have to be aware that it should always use GTT domain if it wants ASYNC. 09:49 Well, GTT, or WB. 09:49 But not CPU. 09:50 In other words, I think the fundamental problem is that ASYNC + GEM_DOMAIN_CPU don't mix. 09:51 yup, that's what we have to do 09:51 check everything on execbuf and cancel the async request if something has happened to the bo 09:51 And we (userspace) will just always set the domain to GTT and just leave it there. 09:52 Which brings up one more question: How does this all work with async maps? 09:52 I think it's probably safe. Here's why: 09:54 1) If we map through the CPU cache and clflush like anv does then the BO accidentally being in the CPU domain due to a swap is completely ok because the worst that happens is that we (anv) clflush it on vkFlushMappedMemoryRange and then the kernel clflushes it again to move it to the GTT domain. 09:54 2) If we map it through the GTT then the kernel will move it back to the GTT domain when it faults in the pages for the GTT read/write? 09:54 3) If we map WC, I have no idea... 09:55 WC should just work, because the CPU guys tell me that cachelines are tracked on physical tags 09:55 GTT is touch and go 09:55 first fault is fine 09:55 post-swap fault is fine, since we had to evict it from the GTT for swapping 09:56 so that is fine from the anv perspective 09:56 two clients one reading from GTT, the other writing via the CPU, may run into problems 09:57 (but equally will likely run into problems with set-domain being a one sided barrier anyway) 09:57 Yeah 09:57 If two clients are trying to access it through GTT/CPU, you're in sketchy territory anyway 09:58 In that case, I would expect that the CPU writing client needs to either clflush or move the BO to GTT domain prior to the other client reading or risk undefined results. 09:58 But there's not much we can do about that. 10:05 Ok, I think I'm convinced that your patch is correct though it could probably use a slightly less passive-agressive commit message. :-) 10:06 definitely, rewrote it in terms of swapping and its urgency is then apparent 10:06 Cool. 10:07 I'll get anv updated to just always use GTT domain. 10:07 oh, for passing to relocs. just set RENDER 10:07 ? 10:07 I think we do check for !GPU_DOMAINS and complain 10:08 So I should put batches in RENDER domain? 10:08 or INSTRUCTION I guess would be more appropriate. 10:08 sure, we don't really care 10:08 But I shouldn't put it in GTT? 10:08 inside it boils down to a single bit of whether you write to the bo 10:09 if (unlikely((reloc->write_domain | reloc->read_domains) 10:09 & ~I915_GEM_GPU_DOMAINS)) { 10:09 return -EINVAL 10:09 Gotcha 10:09 So should I just put everything in RENDER domain? 10:10 Or is GTT better for something? 10:10 there's a complication with gen6 pipecontrols 10:10 (do you care about those?) 10:10 For anv, I'm not liable to care about gen6 pipe controls. :-) 10:10 What's the complication? 10:10 they only used GGTT addresses even though we are using ppgtt 10:11 it's easiest to set execobject.flags |= EXEC_OBJECT_NEEDS_GTT 10:11 but the older method was to use reloc.write_domain = I915_GEM_DOMAIN_INSTRUCTION 10:11 Ah 10:12 Sounds like I should just set write_domain = RENDER for everything and call it a day. 10:13 Unless that will cause the kernel to do piles of extra flushing 10:14 no, we haven't done gpu cache tracking with those domains for ~10 years 10:14 we just f;ush everthing at the end of the batch (and inva;date before evry batch) 10:20 Reading a bit, it seems safer to just set read_domain 10:23 inded, since you can rely on the kernel you use support EXEC_OBJECT_WRITE, it doesn't need reloc.write_domain at all 10:35 If I set NO_RELOC, will the kernel still walk the relocation list to gather up domains? 10:39 Wait, never mind... 10:39 As discussed above, the kernel will automatically move it to some GPU domain if ASYNC is set. 10:40 But what if ASYNC isn't set... 10:48 Hrm... I think I just managed to convince myself that userspace doesn't need to do anything with domains. 10:49 The two things that we need to set domains for, as far as I can tell are implicit fences for X11 and relocations. 10:51 However, fences are taken care of by EXEC_OBJECT_WRITE 10:51 For relocations, vma_move_to_active automatically does Iobj->base.read_domains |= I915_GEM_GPU_DOMAINS; 10:52 so the only time the BO is in the CPU domain is if it's inactive in which case the kernel will do the relocation in the CPU domain and then clflush as part of eb_move_to_gpu.