Starting with the 4.13-rc1 back-merge in drm-tip, the machine fi-elk-e7500 started hard-hanging when running the test igt@gem_exec_suspend@basic-s3. Nothing interesting in the logs. Full logs: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_2845/fi-elk-e7500/igt@gem_exec_suspend@basic-s3.html
Hmm, I wonder if we randomize the test order and then see if it is still the same suspend that kills it.
(In reply to Chris Wilson from comment #1) > Hmm, I wonder if we randomize the test order and then see if it is still the > same suspend that kills it. I guess it would be a better idea to just check if running the test alone is enough to replicate the issue.
Pick something and automate it. ;)
Bisected with elk-e7500, got reasonable looking result: bf22ff45bed664aefb5c4e43029057a199b7070c is the first bad commit commit bf22ff45bed664aefb5c4e43029057a199b7070c Author: Jeffy Chen <jeffy.chen@rock-chips.com> Date: Mon Jun 26 19:33:34 2017 +0800 genirq: Avoid unnecessary low level irq function calls Check irq state in enable/disable/unmask/mask_irq to avoid unnecessary low level irq function calls. This has two advantages: - Conditionals are faster than hardware access - Solves issues with the underlying refcounting of the pinctrl infrastructure Suggested-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Jeffy Chen <jeffy.chen@rock-chips.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: tfiga@chromium.org Cc: briannorris@chromium.org Cc: dianders@chromium.org Link: http://lkml.kernel.org/r/1498476814-12563-2-git-send-email-jeffy.chen@rock-chips.com :040000 040000 ec5072725f8be0a3906e949aa0172cb3e00729d6 27847e81e1c424a62938404fd48bea3c439d74c0 M l
Hardware: HP Compaq 8000 Elite Testcase: igt@gem_exec_suspend@basic-s3 Hard hang with no serial output, no panic-reboot. Double-checked this and commit before this is ok: d829b8fb2431595422289cfc210f0a955a8bec74
I have the same thing on an old cheap laptop. model name : Intel(R) Core(TM)2 CPU cpu family : 6 model : 15 Previously, this never happened on the kernel series 4.{10,11,12}. After reverted commit bf22ff45bed664 on 4.13-rc2, unfortunately it still hangs when it wakes up. Difficult to diagnose because it does not happen right away. After some time of action.
(In reply to latalante from comment #6) > I have the same thing on an old cheap laptop. > model name : Intel(R) Core(TM)2 CPU > cpu family : 6 > model : 15 > > Previously, this never happened on the kernel series 4.{10,11,12}. > After reverted commit bf22ff45bed664 on 4.13-rc2, unfortunately it still > hangs when it wakes up. > Difficult to diagnose because it does not happen right away. After some time > of action. Thanks, I communicated that to the developers. I hope it helps.
(In reply to Martin Peres from comment #7) > Thanks, I communicated that to the developers. I hope it helps. Sorry for the unnecessary misleading. In my case, hanging out after sleep is related to - mq-deadline/ext4 IO hangs. https://lkml.org/lkml/2017/6/25/248 I have no idea why he disappeared default CFQ scheduler, though in the configuration it is: CONFIG_IOSCHED_CFQ=y CONFIG_DEFAULT_CFQ=y CONFIG_DEFAULT_IOSCHED="cfq" 4.12.3 cat /sys/block/sda/queue/scheduler [cfq] 4.13-rc2 cat /sys/block/sda/queue/scheduler [mq-deadline] bfq none After echo none > /sys/block/sda/queue/scheduler I did not noticed any hang on the kernel 4.13-rc2 (out after sleep).
(In reply to latalante from comment #8) > (In reply to Martin Peres from comment #7) > > Thanks, I communicated that to the developers. I hope it helps. > > Sorry for the unnecessary misleading. > In my case, hanging out after sleep is related to - mq-deadline/ext4 IO > hangs. > https://lkml.org/lkml/2017/6/25/248 > > I have no idea why he disappeared default CFQ scheduler, though in the > configuration it is: > CONFIG_IOSCHED_CFQ=y > CONFIG_DEFAULT_CFQ=y > CONFIG_DEFAULT_IOSCHED="cfq" > > 4.12.3 > cat /sys/block/sda/queue/scheduler > [cfq] > > 4.13-rc2 > cat /sys/block/sda/queue/scheduler > [mq-deadline] bfq none > > After > echo none > /sys/block/sda/queue/scheduler > I did not noticed any hang on the kernel 4.13-rc2 (out after sleep). Thanks for the info! The following patch fixes the issue for us: https://lkml.org/lkml/2017/7/27/653
(In reply to Martin Peres from comment #9) > Thanks for the info! > > The following patch fixes the issue for us: > https://lkml.org/lkml/2017/7/27/653 It seems that everything is already on the way to a solution. https://lkml.org/lkml/2017/7/30/128 I also learned how to continue to use CFQ on blk-mq. scsi_mod.use_blk_mq=0
(In reply to latalante from comment #10) > I also learned how to continue to use CFQ on blk-mq. Of course it was supposed to be - I/O schedulers - cfq, noop, deadline for scsi.
commit 4b1cd3afb1c7e918c3e0748dfd4ecb6e43a41573 Author: Thomas Gleixner <tglx@linutronix.de> Date: Mon Jul 31 22:07:09 2017 +0200 x86/hpet: Cure interface abuse in the resume path The HPET resume path abuses irq_domain_[de]activate_irq() to restore the MSI message in the HPET chip for the boot CPU on resume and it relies on an implementation detail of the interrupt core code, which magically makes the HPET unmask call invoked via a irq_disable/enable pair. This worked as long as the irq code did unconditionally invoke the unmask() callback. With the recent changes which keep track of the masked state to avoid expensive hardware access, this does not longer work. As a consequence the HPET timer interrupts are not unmasked which breaks resume as the boot CPU waits forever that a timer interrupt arrives. Make the restore of the MSI message explicit and invoke the unmask() function directly. While at it get rid of the pointless affinity setting as nothing can change the affinity of the interrupt and the vector across suspend/resume. The restore of the MSI message reestablishes the previous affinity setting which is the correct one. Fixes: bf22ff45bed6 ("genirq: Avoid unnecessary low level irq function calls") Reported-by: Martin Peres <martin.peres@linux.intel.com> Reported-by: Tomi Sarvela <tomi.p.sarvela@intel.com> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Cc: jeffy.chen@rock-chips.com Cc: Marc Zyngier <marc.zyngier@arm.com> Cc: Peter Ziljstra <peterz@infradead.org> Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com> Tested-by: Tomi Sarvela <tomi.p.sarvela@intel.com> Applied to core-for-CI and presumed going upstream urgently.
Thanks Chris. I was a little asleep at the wheel this morning it would seem :)
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.