This doesn't happen on the powerplay branch but it does happen on Linus's tree 4.4-rc5 As this appears related to the scheduler I can go back to kernel 4.3 and test that and if it doesn't happen try and bisect if you think it's worthwhile
Created attachment 120605 [details] Screenshot of oops
I tried to bisect between v4.3 and HEAD but there were too many other issues getting in the way - i915, ath10k - making remoting in too difficult when the screen wasn't showing anything
0x24e7e is in amdgpu_vm_grab_id (include/linux/fence.h:292). 287 * Returns true if f1 is chronologically later than f2. Both fences must be 288 * from the same context, since a seqno is not re-used across contexts. 289 */ 290 static inline bool fence_is_later(struct fence *f1, struct fence *f2) 291 { 292 if (WARN_ON(f1->context != f2->context)) 293 return false; This should be normal warnings, isn't bug.
When this happens my machine just freezes, the only way to continue is to press and hold the power button but that doesn't cleanly unmount the disks
(In reply to Mike Lothian from comment #4) > When this happens my machine just freezes, the only way to continue is to > press and hold the power button but that doesn't cleanly unmount the disks Yeah, that is clearly a bug when the driver unloads. Probably rather hard to reproduce, we should add a test case which loads and unloads the driver multiple times while there is load.
(In reply to Christian König from comment #5) > (In reply to Mike Lothian from comment #4) > > When this happens my machine just freezes, the only way to continue is to > > press and hold the power button but that doesn't cleanly unmount the disks > > Yeah, that is clearly a bug when the driver unloads. > > Probably rather hard to reproduce, we should add a test case which loads and > unloads the driver multiple times while there is load. Maybe we shall avoid to use fence for vmid, instead using LRU list.
(In reply to david1.zhou@amd.com from comment #6) > Maybe we shall avoid to use fence for vmid, instead using LRU list. Yeah, thought about that as well. The problem is that we used to have an LRU list and I switched to fences because they had less overhead. We still need to keep the fences around for synchronization, so I'm not sure if that would really help. The real price question is what is going wrong here?
(In reply to Christian König from comment #7) > (In reply to david1.zhou@amd.com from comment #6) > > Maybe we shall avoid to use fence for vmid, instead using LRU list. > > Yeah, thought about that as well. The problem is that we used to have an LRU > list and I switched to fences because they had less overhead. > > We still need to keep the fences around for synchronization, so I'm not sure > if that would really help. > > The real price question is what is going wrong here? yes, we need to identify why the contexts of two fences are different, where two fences come from, what the kind of two fences are, which ring two fences belong.
Not seen this in a while now
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.