Bug 111561 - [CI][DRMTIP] igt@perf_pmu@cpu-hotplug - timeout/dmesg-fail - WARNING: Failed to online cpu
Summary: [CI][DRMTIP] igt@perf_pmu@cpu-hotplug - timeout/dmesg-fail - WARNING: Failed ...
Status: RESOLVED NOTOURBUG
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: Other All
: not set not set
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
: 111545 111546 111547 111548 111550 111573 (view as bug list)
Depends on:
Blocks:
 
Reported: 2019-09-05 07:33 UTC by Lakshmi
Modified: 2019-09-17 13:13 UTC (History)
3 users (show)

See Also:
i915 platform: BDW, BSW/CHT, BXT, BYT, G33, GLK, I965G, PNV, SKL, SNB
i915 features: Perf/PMU


Attachments

Description Lakshmi 2019-09-05 07:33:56 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_360/fi-bdw-5557u/igt@perf_pmu@cpu-hotplug.html
Starting subtest: cpu-hotplug
(perf_pmu:1228) WARNING: Failed to online cpu0! (5)
(perf_pmu:1228) igt_core-WARNING: FATAL ERROR
Received signal SIGQUIT.
Stack trace: 
Received signal SIGQUIT.
Stack trace: 
 #0 [f at#a0l _[sfiagt_ahla_nsdilge_rh+a0nxddl6e]r
+0xd6]
 #1 [kill p#g1+ 0[xk4i0l]l
pg+0x40]
 #2 [nanosleep+0x14]
 #2 [pause+0x11]
 #3 [igt_fatal_error+0x55]
 #4 [__real_main1674+0x2acb]
 #5 [main+0x27]
 #3 [usleep+0x57]
 #4 [__real_main1674+0xe98]
 #5 [main+0x27]
 #6 [__libc_start_main+0xe7]
 #7 [_start+0x2a]
 #6 [__libc_start_main+0xe7]
 #7 [_start+0x2a]
Comment 2 CI Bug Log 2019-09-06 13:20:17 UTC
A CI Bug Log filter associated to this bug has been updated:

{- BLB BWR SNB BDW BYT BSW BXT SLK GLK: igt@perf_pmu@cpu-hotplug - timeout - WARNING: Failed to online cpu -}
{+ BLB BWR SNB BDW BYT BSW BXT SLK GLK: igt@perf_pmu@cpu-hotplug - timeout/dmesg-fail- WARNING: Failed to online cpu +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_361/fi-pnv-d510/igt@perf_pmu@cpu-hotplug.html
Comment 3 CI Bug Log 2019-09-06 15:01:52 UTC
A CI Bug Log filter associated to this bug has been updated:

{- BLB BWR SNB BDW BYT BSW BXT SLK GLK: igt@perf_pmu@cpu-hotplug - timeout/dmesg-fail- WARNING: Failed to online cpu -}
{+ BLB BWR SNB BDW BYT BSW BXT SLK GLK: igt@perf_pmu@cpu-hotplug - timeout/dmesg-fail- WARNING: Failed to online cpu +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_361/fi-elk-e7500/igt@perf_pmu@cpu-hotplug.html
Comment 4 Chris Wilson 2019-09-06 18:01:59 UTC
1e19ec6c3c417a0893fcfae7abfba623e781d876 is the first bad commit
ickle@kabylake:~/linux$ git show 1e19ec6c3c417a0893fcfae7abfba623e781d876
commit 1e19ec6c3c417a0893fcfae7abfba623e781d876 (HEAD, refs/bisect/bad)
Merge: 7610bb0bde4c 424c38a4e325
Author: Dave Airlie <airlied@redhat.com>
Date:   Fri Sep 6 16:25:45 2019 +1000

    Merge tag 'drm-misc-fixes-2019-09-05' of git://anongit.freedesktop.org/drm/drm-misc into drm-fixes
    
    drm-misc-fixes for v5.3 final:
    - Make ingenic panel type DPI insteado f unknown.
    - Fixes for command line parser modes.
    
    Signed-off-by: Dave Airlie <airlied@redhat.com>
    
    From: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/606d87b2-1840-c893-eb30-d6c471c9e50a@linux.intel.com

* cries
Comment 5 Chris Wilson 2019-09-06 19:12:48 UTC
Second attempt is more promising,

7af0145067bc429a09ac4047b167c0971c9f0dc7 is the first bad commit
commit 7af0145067bc429a09ac4047b167c0971c9f0dc7
Author: Thomas Gleixner <tglx@linutronix.de>
Date:   Thu Aug 29 00:31:34 2019 +0200

    x86/mm/cpa: Prevent large page split when ftrace flips RW on kernel text
    
    ftrace does not use text_poke() for enabling trace functionality. It uses
    its own mechanism and flips the whole kernel text to RW and back to RO.
    
    The CPA rework removed a loop based check of 4k pages which tried to
    preserve a large page by checking each 4k page whether the change would
    actually cover all pages in the large page.
    
    This resulted in endless loops for nothing as in testing it turned out that
    it actually never preserved anything. Of course testing missed to include
    ftrace, which is the one and only case which benefitted from the 4k loop.
    
    As a consequence enabling function tracing or ftrace based kprobes results
    in a full 4k split of the kernel text, which affects iTLB performance.
    
    The kernel RO protection is the only valid case where this can actually
    preserve large pages.
    
    All other static protections (RO data, data NX, PCI, BIOS) are truly
    static.  So a conflict with those protections which results in a split
    should only ever happen when a change of memory next to a protected region
    is attempted. But these conflicts are rightfully splitting the large page
    to preserve the protected regions. In fact a change to the protected
    regions itself is a bug and is warned about.
    
    Add an exception for the static protection check for kernel text RO when
    the to be changed region spawns a full large page which allows to preserve
    the large mappings. This also prevents the syslog to be spammed about CPA
    violations when ftrace is used.
    
    The exception needs to be removed once ftrace switched over to text_poke()
    which avoids the whole issue.
    
    Fixes: 585948f4f695 ("x86/mm/cpa: Avoid the 4k pages check completely")
    Reported-by: Song Liu <songliubraving@fb.com>
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Tested-by: Song Liu <songliubraving@fb.com>
    Reviewed-by: Song Liu <songliubraving@fb.com>
    Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1908282355340.1938@nanos.tec.linutronix.de

:040000 040000 7573fe3d4b07246d455fa84e2b9a988e5d2e07de 03d2c7f17892fe1e3348ab484a9a6f5f64d69f44 M	arch
Comment 6 Chris Wilson 2019-09-06 20:27:48 UTC
And that appears to be a dud as well. Looks like my testcase isn't as reliable as I thought.
Comment 7 Chris Wilson 2019-09-07 09:36:24 UTC
Third attempt gave,

558682b5291937a70748d36fd9ba757fb25b99ae is the first bad commit
commit 558682b5291937a70748d36fd9ba757fb25b99ae
Author: Bandan Das <bsd@redhat.com>
Date:   Mon Aug 26 06:15:13 2019 -0400

    x86/apic: Include the LDR when clearing out APIC registers
    
    Although APIC initialization will typically clear out the LDR before
    setting it, the APIC cleanup code should reset the LDR.
    
    This was discovered with a 32-bit KVM guest jumping into a kdump
    kernel. The stale bits in the LDR triggered a bug in the KVM APIC
    implementation which caused the destination mapping for VCPUs to be
    corrupted.
    
    Note that this isn't intended to paper over the KVM APIC bug. The kernel
    has to clear the LDR when resetting the APIC registers except when X2APIC
    is enabled.
    
    This lacks a Fixes tag because missing to clear LDR goes way back into pre
    git history.
    
    [ tglx: Made x2apic_enabled a function call as required ]
    
    Signed-off-by: Bandan Das <bsd@redhat.com>
    Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190826101513.5080-3-bsd@redhat.com

:040000 040000 f523ff2c03fab6308dad45a102e0fd3313a00475 76d836afd63b7f540a124e763456c233d07c7061 M	arch
Comment 8 Chris Wilson 2019-09-07 10:10:49 UTC
commit e7fe60181cc827a4901a938149376f290372e2e7 (drm-intel/topic/core-for-CI, topic/core-for-CI)
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sat Sep 7 10:41:00 2019 +0100

    Revert "x86/apic: Include the LDR when clearing out APIC registers"
    
    This reverts commit 558682b5291937a70748d36fd9ba757fb25b99ae.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=111561
Comment 9 Chris Wilson 2019-09-07 11:32:10 UTC
*** Bug 111546 has been marked as a duplicate of this bug. ***
Comment 10 Chris Wilson 2019-09-07 13:16:59 UTC
*** Bug 111573 has been marked as a duplicate of this bug. ***
Comment 11 Chris Wilson 2019-09-07 13:17:05 UTC
*** Bug 111550 has been marked as a duplicate of this bug. ***
Comment 12 Chris Wilson 2019-09-07 13:17:12 UTC
*** Bug 111548 has been marked as a duplicate of this bug. ***
Comment 13 Chris Wilson 2019-09-07 13:17:19 UTC
*** Bug 111547 has been marked as a duplicate of this bug. ***
Comment 14 Chris Wilson 2019-09-07 13:17:26 UTC
*** Bug 111545 has been marked as a duplicate of this bug. ***
Comment 16 CI Bug Log 2019-09-17 13:13:06 UTC
A CI Bug Log filter associated to this bug has been updated:

{- BLB BWR SNB BDW BYT BSW BXT SLK GLK: igt@perf_pmu@cpu-hotplug - timeout/dmesg-fail- WARNING: Failed to online cpu -}
{+ All machines: igt@perf_pmu@cpu-hotplug - timeout/dmesg-fail- WARNING: Failed to online cpu +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_363/fi-tgl-u/igt@perf_pmu@cpu-hotplug.html


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.