101838 – [BAT][ELK] The machine fails to resume after 4.13-rc1

Bug 101838 - [BAT][ELK] The machine fails to resume after 4.13-rc1

Summary: [BAT][ELK] The machine fails to resume after 4.13-rc1

Status:	CLOSED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	DRI git
Hardware:	Other All

Importance:	high critical
Assignee:	Intel GFX Bugs mailing list
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:	ReadyForDev
Keywords:

Depends on:
Blocks:

Reported:	2017-07-19 11:07 UTC by Martin Peres
Modified:	2017-08-01 09:54 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:	G45
i915 features:	power/suspend-resume

Attachments

Description Martin Peres 2017-07-19 11:07:46 UTC

Starting with the 4.13-rc1 back-merge in drm-tip, the machine fi-elk-e7500 started hard-hanging when running the test igt@gem_exec_suspend@basic-s3.

Nothing interesting in the logs.

Full logs: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_2845/fi-elk-e7500/igt@gem_exec_suspend@basic-s3.html

Comment 1 Chris Wilson 2017-07-19 12:27:35 UTC

Hmm, I wonder if we randomize the test order and then see if it is still the same suspend that kills it.

Comment 2 Martin Peres 2017-07-19 14:42:24 UTC

(In reply to Chris Wilson from comment #1)
> Hmm, I wonder if we randomize the test order and then see if it is still the
> same suspend that kills it.

I guess it would be a better idea to just check if running the test alone is enough to replicate the issue.

Comment 3 Chris Wilson 2017-07-19 14:57:19 UTC

Pick something and automate it. ;)

Comment 4 Tomi Sarvela 2017-07-21 10:26:48 UTC

Bisected with elk-e7500, got reasonable looking result:

bf22ff45bed664aefb5c4e43029057a199b7070c is the first bad commit
commit bf22ff45bed664aefb5c4e43029057a199b7070c
Author: Jeffy Chen <jeffy.chen@rock-chips.com>
Date:   Mon Jun 26 19:33:34 2017 +0800

    genirq: Avoid unnecessary low level irq function calls
    
    Check irq state in enable/disable/unmask/mask_irq to avoid unnecessary
    low level irq function calls.
    
    This has two advantages:
        - Conditionals are faster than hardware access
    
        - Solves issues with the underlying refcounting of the pinctrl
          infrastructure
    
    Suggested-by: Thomas Gleixner <tglx@linutronix.de>
    Signed-off-by: Jeffy Chen <jeffy.chen@rock-chips.com>
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Cc: tfiga@chromium.org
    Cc: briannorris@chromium.org
    Cc: dianders@chromium.org
    Link: http://lkml.kernel.org/r/1498476814-12563-2-git-send-email-jeffy.chen@rock-chips.com

:040000 040000 ec5072725f8be0a3906e949aa0172cb3e00729d6 27847e81e1c424a62938404fd48bea3c439d74c0 M    l

Comment 5 Tomi Sarvela 2017-07-21 10:35:18 UTC

Hardware: HP Compaq 8000 Elite
Testcase: igt@gem_exec_suspend@basic-s3

Hard hang with no serial output, no panic-reboot.

Double-checked this and commit before this is ok:
d829b8fb2431595422289cfc210f0a955a8bec74

Comment 6 latalante 2017-07-25 08:23:22 UTC

I have the same thing on an old cheap laptop.
model name      : Intel(R) Core(TM)2 CPU
cpu family      : 6
model           : 15

Previously, this never happened on the kernel series 4.{10,11,12}.
After reverted commit bf22ff45bed664 on 4.13-rc2, unfortunately it still hangs when it wakes up.
Difficult to diagnose because it does not happen right away. After some time of action.

Comment 7 Martin Peres 2017-07-26 13:47:54 UTC

(In reply to latalante from comment #6)
> I have the same thing on an old cheap laptop.
> model name      : Intel(R) Core(TM)2 CPU
> cpu family      : 6
> model           : 15
> 
> Previously, this never happened on the kernel series 4.{10,11,12}.
> After reverted commit bf22ff45bed664 on 4.13-rc2, unfortunately it still
> hangs when it wakes up.
> Difficult to diagnose because it does not happen right away. After some time
> of action.

Thanks, I communicated that to the developers. I hope it helps.

Comment 8 latalante 2017-07-27 18:31:24 UTC

(In reply to Martin Peres from comment #7)
> Thanks, I communicated that to the developers. I hope it helps.

Sorry for the unnecessary misleading.
In my case, hanging out after sleep is related to - mq-deadline/ext4 IO hangs.
https://lkml.org/lkml/2017/6/25/248

I have no idea why he disappeared default CFQ scheduler, though in the configuration it is:
CONFIG_IOSCHED_CFQ=y
CONFIG_DEFAULT_CFQ=y
CONFIG_DEFAULT_IOSCHED="cfq"

4.12.3
cat /sys/block/sda/queue/scheduler
[cfq]

4.13-rc2
cat /sys/block/sda/queue/scheduler
[mq-deadline] bfq none

After
echo none > /sys/block/sda/queue/scheduler
I did not noticed any hang on the kernel 4.13-rc2 (out after sleep).

Comment 9 Martin Peres 2017-07-28 14:59:43 UTC

(In reply to latalante from comment #8)
> (In reply to Martin Peres from comment #7)
> > Thanks, I communicated that to the developers. I hope it helps.
> 
> Sorry for the unnecessary misleading.
> In my case, hanging out after sleep is related to - mq-deadline/ext4 IO
> hangs.
> https://lkml.org/lkml/2017/6/25/248
> 
> I have no idea why he disappeared default CFQ scheduler, though in the
> configuration it is:
> CONFIG_IOSCHED_CFQ=y
> CONFIG_DEFAULT_CFQ=y
> CONFIG_DEFAULT_IOSCHED="cfq"
> 
> 4.12.3
> cat /sys/block/sda/queue/scheduler
> [cfq]
> 
> 4.13-rc2
> cat /sys/block/sda/queue/scheduler
> [mq-deadline] bfq none
> 
> After
> echo none > /sys/block/sda/queue/scheduler
> I did not noticed any hang on the kernel 4.13-rc2 (out after sleep).

Thanks for the info!

The following patch fixes the issue for us: https://lkml.org/lkml/2017/7/27/653

Comment 10 latalante 2017-07-31 08:40:21 UTC

(In reply to Martin Peres from comment #9)
> Thanks for the info!
> 
> The following patch fixes the issue for us:
> https://lkml.org/lkml/2017/7/27/653

It seems that everything is already on the way to a solution.
https://lkml.org/lkml/2017/7/30/128

I also learned how to continue to use CFQ on blk-mq.
scsi_mod.use_blk_mq=0

Comment 11 latalante 2017-07-31 12:19:00 UTC

(In reply to latalante from comment #10)
> I also learned how to continue to use CFQ on blk-mq.

Of course it was supposed to be - I/O schedulers - cfq, noop, deadline for scsi.

Comment 12 Chris Wilson 2017-08-01 09:52:00 UTC

commit 4b1cd3afb1c7e918c3e0748dfd4ecb6e43a41573
Author: Thomas Gleixner <tglx@linutronix.de>
Date:   Mon Jul 31 22:07:09 2017 +0200

    x86/hpet: Cure interface abuse in the resume path
    
    The HPET resume path abuses irq_domain_[de]activate_irq() to restore the
    MSI message in the HPET chip for the boot CPU on resume and it relies on an
    implementation detail of the interrupt core code, which magically makes the
    HPET unmask call invoked via a irq_disable/enable pair. This worked as long
    as the irq code did unconditionally invoke the unmask() callback. With the
    recent changes which keep track of the masked state to avoid expensive
    hardware access, this does not longer work. As a consequence the HPET timer
    interrupts are not unmasked which breaks resume as the boot CPU waits
    forever that a timer interrupt arrives.
    
    Make the restore of the MSI message explicit and invoke the unmask()
    function directly. While at it get rid of the pointless affinity setting as
    nothing can change the affinity of the interrupt and the vector across
    suspend/resume. The restore of the MSI message reestablishes the previous
    affinity setting which is the correct one.
    
    Fixes: bf22ff45bed6 ("genirq: Avoid unnecessary low level irq function calls")
    Reported-by: Martin Peres <martin.peres@linux.intel.com>
    Reported-by: Tomi Sarvela <tomi.p.sarvela@intel.com>
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Cc: jeffy.chen@rock-chips.com
    Cc: Marc Zyngier <marc.zyngier@arm.com>
    Cc: Peter Ziljstra <peterz@infradead.org>
    Cc: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com>
    Tested-by: Tomi Sarvela <tomi.p.sarvela@intel.com>


Applied to core-for-CI and presumed going upstream urgently.

Comment 13 Martin Peres 2017-08-01 09:54:41 UTC

Thanks Chris. I was a little asleep at the wheel this morning it would seem :)

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.