Bug 100129

Summary: [CTG][BAT]gem_exec_suspend@basic-s3 incomplete on C
Product: DRI Reporter: Jani Saarinen <jani.saarinen>
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: CLOSED WORKSFORME QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: medium CC: intel-gfx-bugs, tomeu
Version: DRI git   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:

Comment 1 Chris Wilson 2017-03-16 14:32:20 UTC
[  254.327753] ===============================
[  254.327753] [ ERR: suspicious RCU usage.  ]
[  254.327753] 4.10.0-CI-CI_DRM_2254+ #1 Not tainted
[  254.327753] -------------------------------
[  254.327753] kernel/workqueue.c:712 sched RCU or wq_pool_mutex should be held!
[  254.327753]
               other info that might help us debug this:

[  254.327753]
               RCU used illegally from offline CPU!
               rcu_scheduler_active = 2, debug_locks = 0
[  254.327753] no locks held by swapper/1/0.
[  254.327753]
               stack backtrace:
[  254.327753] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.10.0-CI-CI_DRM_2254+ #1
[  254.327753] Hardware name: LENOVO 7465CTO/7465CTO, BIOS 6DET44WW (2.08 ) 04/22/2009
[  254.327753] Call Trace:
[  254.327753]  dump_stack+0x67/0x92
[  254.327753]  lockdep_rcu_suspicious+0xd7/0x110
[  254.327753]  get_work_pool+0xb1/0xc0
[  254.327753]  __queue_work+0x89/0x620
[  254.327753]  queue_work_on+0x33/0x70
[  254.327753]  clear_sched_clock_stable+0x33/0x40
[  254.327753]  early_init_intel+0x10a/0x310
[  254.327753]  init_intel+0x11/0x310
[  254.327753]  identify_cpu+0x309/0x520
[  254.327753]  identify_secondary_cpu+0x18/0x80
[  254.327753]  smp_store_cpu_info+0x3a/0x40
[  254.327753]  start_secondary+0x53/0x120
[  254.327753]  start_cpu+0x14/0x14
[  254.337007] BUG: using smp_processor_id() in preemptible [00000000] code: kworker/1:2/105
[  254.337007] caller is debug_smp_processor_id+0x17/0x20
[  254.337007] CPU: 0 PID: 105 Comm: kworker/1:2 Not tainted 4.10.0-CI-CI_DRM_2254+ #1
[  254.337007] Disabled fast string operations
[  254.337007] Hardware name: LENOVO 7465CTO/7465CTO, BIOS 6DET44WW (2.08 ) 04/22/2009
[  254.337007] Workqueue: events __clear_sched_clock_stable
[  254.337007] Call Trace:
[  254.337007]  dump_stack+0x67/0x92
[  254.337007]  check_preemption_disabled+0xd7/0xe0
[  254.337007]  debug_smp_processor_id+0x17/0x20
[  254.337007]  __clear_sched_clock_stable+0x11/0x60
[  254.337007]  process_one_work+0x1f4/0x6d0
[  254.337007]  ? process_one_work+0x16e/0x6d0
[  254.337007]  worker_thread+0x49/0x4a0
[  254.337007]  kthread+0x107/0x140
[  254.337007]  ? process_one_work+0x6d0/0x6d0
[  254.337007]  ? kthread_create_on_node+0x40/0x40
[  254.337007]  ret_from_fork+0x2e/0x40
[  254.337007] sched_clock: Marking unstable (254336007317, -321214989128)<-(-66878981811, 0)
[  254.758282]  cache: parent cpu1 should not be sleeping
Comment 2 Chris Wilson 2017-03-16 14:36:30 UTC
Date: Wed, 8 Mar 2017 14:16:56 -0800
From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
Subject: RCU used on incoming CPU before rcu_cpu_starting() called
Message-Id: <20170308221656.GA11949@linux.vnet.ibm.com>
Comment 3 Tomeu Vizoso 2017-03-21 09:22:13 UTC
(In reply to Chris Wilson from comment #2)
> Date: Wed, 8 Mar 2017 14:16:56 -0800
> From: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>
> Subject: RCU used on incoming CPU before rcu_cpu_starting() called
> Message-Id: <20170308221656.GA11949@linux.vnet.ibm.com>

According to <20170320125000.GG3637@linux.vnet.ibm.com>, the commit below (present in 4.11-rc2) should have fixed this condition:

   f94c8d116997 ("sched/clock, x86/tsc: Rework the x86 'unstable' sched_clock() interface")
Comment 4 Chris Wilson 2017-03-21 09:41:21 UTC
Until it happens again...
Comment 5 Martin Peres 2017-05-15 07:49:54 UTC
(In reply to Chris Wilson from comment #4)
> Until it happens again...

It did: https://intel-gfx-ci.01.org/CI/CI_DRM_2610/fi-ctg-l9400/igt@gem_exec_suspend@basic-s3.html
Comment 6 Martin Peres 2017-05-15 13:28:03 UTC
(In reply to Martin Peres from comment #5)
> (In reply to Chris Wilson from comment #4)
> > Until it happens again...
> 
> It did:
> https://intel-gfx-ci.01.org/CI/CI_DRM_2610/fi-ctg-l9400/
> igt@gem_exec_suspend@basic-s3.html

In the last 6 runs, it failed 5 times.
Comment 7 Mika Kuoppala 2017-05-19 10:50:39 UTC
gem_exec_suspend@basic-s3 incomplete on fi-skl-6260u.

https://intel-gfx-ci.01.org/CI/Patchwork_4753/fi-skl-6260u/igt@gem_exec_suspend@basic-s3.html
Comment 8 Mika Kuoppala 2017-05-19 10:52:07 UTC
Forgot to add that the patch in question does not touch skl codepaths, so it is in CI_DRM_2635.
Comment 9 Mika Kuoppala 2017-05-19 11:52:23 UTC
As Martin noted, very likely that skl is a separate. So made a new bug
to track skl.

https://bugs.freedesktop.org/show_bug.cgi?id=101098
Comment 10 Martin Peres 2017-05-30 14:29:37 UTC
So, we found that this particular machine had a very weird clock drift when suspended. We swapped this machine for another one which does not show this behaviour. I'll re-open if the problem re-appears.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.