96445 – [amdgpu][tonga] display freezes soon after X start

Bug 96445 - [amdgpu][tonga] display freezes soon after X start

Summary: [amdgpu][tonga] display freezes soon after X start

Status:	RESOLVED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/AMDgpu (show other bugs)
Version:	DRI git
Hardware:	Other All

Importance:	medium normal
Assignee:	Default DRI bug account
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2016-06-08 22:04 UTC by csaba.halasz
Modified:	2016-06-14 08:11 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:
i915 features:

Attachments
avoid schedule() during spinlock (1.63 KB, patch) 2016-06-13 07:01 UTC, Nicolai Hähnle	no flags	Details \| Splinter Review
Possible fix (1.18 KB, patch) 2016-06-13 08:08 UTC, Christian König	no flags	Details \| Splinter Review
Show Obsolete (1) View All

Description csaba.halasz 2016-06-08 22:04:39 UTC

Starting X and just trying to do general stuff leads to display freeze within minutes. Most of the time nothing in syslog but during bisecting I have at least once seen this message which might be unrelated:

[drm:process_one_work] *ERROR* ring sdma1 timeout, last signaled seq=144370, last emitted seq=144370

Initially I got an RCU stall message:

 INFO: rcu_sched detected stalls on CPUs/tasks:
 	3-...: (13593 GPs behind) idle=561/140000000000000/0 softirq=0/0 fqs=14996 
 	5-...: (13593 GPs behind) idle=2e7/140000000000000/0 softirq=0/0 fqs=14996 
 	(detected by 0, t=15002 jiffies, g=13293, c=13292, q=0)
 Task dump for CPU 3:
 sdma1           R  running task        0   499      2 0x00080008
  ffffffff810f83fb ffff8803e02de580 ffffffffa0333d01 ffff8800b9d1c110
  ffff8800b89a3c00 ffff8800b89a3cd8 0000000000000000 ffff8800b9d1c110
  ffffffffa036812b ffff8800b9fd58b8 ffffffff813c66bf ffff8800b9fd58b8
 Call Trace:
  [<ffffffff810f83fb>] ? kmem_cache_free+0xab/0xc0
  [<ffffffffa0333d01>] ? amdgpu_sync_get_fence+0x51/0xc0 [amdgpu]
  [<ffffffffa036812b>] ? amdgpu_job_dependency+0x2b/0xb0 [amdgpu]
  [<ffffffff813c66bf>] ? _raw_spin_lock_irqsave+0x1f/0x30
  [<ffffffffa0367643>] ? amd_sched_main+0x1a3/0x3f0 [amdgpu]
  [<ffffffff81077310>] ? add_wait_queue+0x60/0x60
  [<ffffffffa03674a0>] ? amd_sched_process_job+0x70/0x70 [amdgpu]
  [<ffffffff8105fbbc>] ? kthread+0xbc/0xe0
  [<ffffffff813c6b02>] ? ret_from_fork+0x22/0x40
  [<ffffffff8105fb00>] ? kthread_stop+0x70/0x70
 Task dump for CPU 5:
 kworker/5:1     R  running task        0   143      2 0x00080008
 Workqueue: events amd_sched_job_finish [amdgpu]
  ffff88043ed57900 0000000000000000 ffffffff81059ae2 0000000000000018
  ffff88042b4e0000 ffff88042b982d00 ffff88043ed53420 ffff88042b982d00
  ffff88042b982d00 ffff88043ed53400 ffff8800ba673d80 ffff8800ba673db0
 Call Trace:
  [<ffffffff81059ae2>] ? process_one_work+0x132/0x350
  [<ffffffff8105b3fe>] ? worker_thread+0x11e/0x430
  [<ffffffff8105b2e0>] ? create_worker+0x180/0x180
  [<ffffffff8105fbbc>] ? kthread+0xbc/0xe0
  [<ffffffff813c6b02>] ? ret_from_fork+0x22/0x40
  [<ffffffff8105fb00>] ? kthread_stop+0x70/0x70

During bisecting I probably did not wait long enough for this to show up (apparently it's configured for 1 minute).
According to git bisect:

8df07daf3952b7606e2d17076198ec3fb38ab1f1 is the first bad commit
commit 8df07daf3952b7606e2d17076198ec3fb38ab1f1
Date:   Thu May 19 09:54:15 2016 +0200

    drm/amdgpu: fix and cleanup job destruction


kernel agd5f/drm-next-4.8-wip
mesa git 65c2abf6fdd51b0a80a72caa0c52cf3f4578e743
llvm git ef1f2996c17c9b1480201239002b58851810e8fc
xf86-video-amdgpu git 60ced5026ebc34d9f32c7618430b6a7ef7c8eb4b
Xorg 1.18.0
mplayer svn r37870
gigabyte 380 (tonga)

Comment 1 Christian König 2016-06-09 13:35:56 UTC

Yeah, we stumbled over that problem internally as well and are already working on it.

Comment 2 Nicolai Hähnle 2016-06-13 07:01:58 UTC

Created attachment 124493 [details] [review]
avoid schedule() during spinlock

Hi Csaba! The attached patch doesn't fix the problem for me, but it seems correct and at least changes the symptoms. Maybe it helps on your system?

Comment 3 Christian König 2016-06-13 08:08:38 UTC

Created attachment 124495 [details] [review]
Possible fix

Complete fix for the issue, thanks to Nicolai for pointing me into the right direction.

Comment 4 Nicolai Hähnle 2016-06-13 14:05:30 UTC

This patches does the trick. I've run my stress test for about an hour, so it's safe to say that it's fixed - feel free to add my Tested-by.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.