Starting X and just trying to do general stuff leads to display freeze within minutes. Most of the time nothing in syslog but during bisecting I have at least once seen this message which might be unrelated: [drm:process_one_work] *ERROR* ring sdma1 timeout, last signaled seq=144370, last emitted seq=144370 Initially I got an RCU stall message: INFO: rcu_sched detected stalls on CPUs/tasks: 3-...: (13593 GPs behind) idle=561/140000000000000/0 softirq=0/0 fqs=14996 5-...: (13593 GPs behind) idle=2e7/140000000000000/0 softirq=0/0 fqs=14996 (detected by 0, t=15002 jiffies, g=13293, c=13292, q=0) Task dump for CPU 3: sdma1 R running task 0 499 2 0x00080008 ffffffff810f83fb ffff8803e02de580 ffffffffa0333d01 ffff8800b9d1c110 ffff8800b89a3c00 ffff8800b89a3cd8 0000000000000000 ffff8800b9d1c110 ffffffffa036812b ffff8800b9fd58b8 ffffffff813c66bf ffff8800b9fd58b8 Call Trace: [<ffffffff810f83fb>] ? kmem_cache_free+0xab/0xc0 [<ffffffffa0333d01>] ? amdgpu_sync_get_fence+0x51/0xc0 [amdgpu] [<ffffffffa036812b>] ? amdgpu_job_dependency+0x2b/0xb0 [amdgpu] [<ffffffff813c66bf>] ? _raw_spin_lock_irqsave+0x1f/0x30 [<ffffffffa0367643>] ? amd_sched_main+0x1a3/0x3f0 [amdgpu] [<ffffffff81077310>] ? add_wait_queue+0x60/0x60 [<ffffffffa03674a0>] ? amd_sched_process_job+0x70/0x70 [amdgpu] [<ffffffff8105fbbc>] ? kthread+0xbc/0xe0 [<ffffffff813c6b02>] ? ret_from_fork+0x22/0x40 [<ffffffff8105fb00>] ? kthread_stop+0x70/0x70 Task dump for CPU 5: kworker/5:1 R running task 0 143 2 0x00080008 Workqueue: events amd_sched_job_finish [amdgpu] ffff88043ed57900 0000000000000000 ffffffff81059ae2 0000000000000018 ffff88042b4e0000 ffff88042b982d00 ffff88043ed53420 ffff88042b982d00 ffff88042b982d00 ffff88043ed53400 ffff8800ba673d80 ffff8800ba673db0 Call Trace: [<ffffffff81059ae2>] ? process_one_work+0x132/0x350 [<ffffffff8105b3fe>] ? worker_thread+0x11e/0x430 [<ffffffff8105b2e0>] ? create_worker+0x180/0x180 [<ffffffff8105fbbc>] ? kthread+0xbc/0xe0 [<ffffffff813c6b02>] ? ret_from_fork+0x22/0x40 [<ffffffff8105fb00>] ? kthread_stop+0x70/0x70 During bisecting I probably did not wait long enough for this to show up (apparently it's configured for 1 minute). According to git bisect: 8df07daf3952b7606e2d17076198ec3fb38ab1f1 is the first bad commit commit 8df07daf3952b7606e2d17076198ec3fb38ab1f1 Date: Thu May 19 09:54:15 2016 +0200 drm/amdgpu: fix and cleanup job destruction kernel agd5f/drm-next-4.8-wip mesa git 65c2abf6fdd51b0a80a72caa0c52cf3f4578e743 llvm git ef1f2996c17c9b1480201239002b58851810e8fc xf86-video-amdgpu git 60ced5026ebc34d9f32c7618430b6a7ef7c8eb4b Xorg 1.18.0 mplayer svn r37870 gigabyte 380 (tonga)
Yeah, we stumbled over that problem internally as well and are already working on it.
Created attachment 124493 [details] [review] avoid schedule() during spinlock Hi Csaba! The attached patch doesn't fix the problem for me, but it seems correct and at least changes the symptoms. Maybe it helps on your system?
Created attachment 124495 [details] [review] Possible fix Complete fix for the issue, thanks to Nicolai for pointing me into the right direction.
This patches does the trick. I've run my stress test for about an hour, so it's safe to say that it's fixed - feel free to add my Tested-by.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.