Bug 96584 - [regression] [i915] DMAR Errors Spamming Logs
Summary: [regression] [i915] DMAR Errors Spamming Logs
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: x86-64 (AMD64) Linux (All)
: medium minor
Assignee: Chris Wilson
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-06-19 00:56 UTC by Mike Lothian
Modified: 2017-07-24 22:41 UTC (History)
2 users (show)

See Also:
i915 platform: SKL
i915 features: GEM/Other


Attachments
dmesg (76.52 KB, text/plain)
2016-06-19 01:19 UTC, Mike Lothian
no flags Details
More skylake w/a (1.32 KB, patch)
2016-06-23 06:27 UTC, Chris Wilson
no flags Details | Splinter Review
Chris'patch rebased vs drm-intel-nightly head (1.49 KB, patch)
2016-07-01 15:06 UTC, yann
no flags Details | Splinter Review

Description Mike Lothian 2016-06-19 00:56:00 UTC
Hi

My dmesg is now filled with:

DMAR: DRHD: handling fault status reg 3
DMAR: DMAR:[DMA Read] Request device [00:02.0] fault addr fbff0000 
DMAR:[fault reason 06] PTE Read access is not set

It seems to only happen when before X starts and during shutdown

I've bisected it down to:

975f7ff42edfbad53d65ad63a4f3e7ada8c7538f is the first bad commit
commit 975f7ff42edfbad53d65ad63a4f3e7ada8c7538f
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sat May 14 07:26:34 2016 +0100

    drm/i915: Lazily migrate the objects after hibernation
    
    Now that we mark the object domains for having been restored from the
    hibernation image, we not need to flush everything during resume and
    can instead rely on the normal domain tracking to flush only when
    required. The only caveat here are objects that are pinned for use by
    the hardware, whose contents must be coherent for when the device
    resumes reading from then (shortly afterwards with the driver assuming
    the objects are in the correct domain).
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=94722
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Imre Deak <imre.deak@intel.com>
    Cc: David Weinehall <david.weinehall@intel.com>
    Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
    Tested-by: David Weinehall <david.weinehall@intel.com>
    Link: http://patchwork.freedesktop.org/patch/msgid/1463207195-22076-3-git-send-email-chris@chris-wilson.co.uk

:040000 040000 336a603f6bd03d205632a4e131f771638c8b65b0 91bebf7c1f376c76ef92e6e1a7ddc14d674378ee M      drivers
Comment 1 Mike Lothian 2016-06-19 01:11:45 UTC
I'm starting to think I've made a mistake bisecting as reverting that commit doesn't seem to fix things for me
Comment 2 Mike Lothian 2016-06-19 01:19:36 UTC
Created attachment 124596 [details]
dmesg
Comment 3 Mike Lothian 2016-06-19 01:22:26 UTC
Sorry I made a mistake in the last part of the bisect

The first broken commit is:

commit f7770bfd9fd2ef13a5b70de1ffbc16019a929b48
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sat May 14 07:26:35 2016 +0100

    drm/i915: Skip clearing the GGTT on full-ppgtt systems
    
    Under full-ppgtt, access to the global GTT is carefully regulated
    through hardware functions (i.e. userspace cannot read and write to
    arbitrary locations in the GGTT via the GPU). With this restriction in
    place, we can forgo clearing stale entries from the GGTT as they will
    not be accessed.
    
    For aliasing-ppgtt, we could almost do the same except that we do allow
    userspace access to the global-GTT via execbuf in order to workraound
    some quirks of certain instructions. (This execbuf path is filtered out
    with EINVAL on full-ppgtt.)
    
    The most dramatic effect this will have will be during resume, as with
    full-ppgtt the GGTT is only used sparingly.
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=94722
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: David Weinehall <david.weinehall@intel.com>
    Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
    Tested-by: David Weinehall <david.weinehall@intel.com>
    Link: http://patchwork.freedesktop.org/patch/msgid/1463207195-22076-4-git-send-email-chris@chris-wilson.co.uk
Comment 4 Chris Wilson 2016-06-19 06:15:42 UTC
That looks more like a tell-tale about forgetting to rebind an object (or we are now catching an out-of-bounds access). Hardware? Skylake?
Comment 5 Mike Lothian 2016-06-19 13:58:16 UTC
Yes it's skylake:

00:02.0 VGA compatible controller [0300]: Intel Corporation HD Graphics 530 [8086:191b] (rev 06)
Comment 6 Mike Lothian 2016-06-19 13:58:46 UTC
Also reverting that commit makes the errors go away
Comment 7 Chris Wilson 2016-06-20 09:35:07 UTC
Hmm, there are also lots of caveats to mixing DMAR and Skylake igfx that we need to investigate to see which require driver workarounds.
Comment 9 Chris Wilson 2016-06-20 09:45:34 UTC
In particular, I think this matches SKL036 ("Processor Graphics IOMMU Unit May Report Spurious Faults") since we are not marking the PTE as present at all. If we don't see similar failures across platforms (e.g. broadwell with execlists/full-ppgtt), then it is safe to assume this is a Skylake problem.
Comment 10 Mike Lothian 2016-06-23 02:00:41 UTC
Is there anything else you'd like me to do?
Comment 11 Chris Wilson 2016-06-23 06:27:21 UTC
Created attachment 124675 [details] [review]
More skylake w/a

Will be the patch, with more or less models.
Comment 12 Mike Lothian 2016-06-30 21:25:47 UTC
What kernel is that patch supposed to apply against?
Comment 13 yann 2016-07-01 15:06:20 UTC
Created attachment 124839 [details] [review]
Chris'patch rebased vs drm-intel-nightly head
Comment 14 yann 2016-07-01 15:08:12 UTC
Mike, you can apply this patch using last drm-intel-nightly kernel
Comment 15 Mike Lothian 2016-07-01 15:10:05 UTC
Do you have the url to clone for that? I currently build linus's tree, Dave's drm-next and Alex's drm-next-4.8-wip branch
Comment 16 yann 2016-07-01 15:14:53 UTC
https://cgit.freedesktop.org/drm-intel/

you can then:  git clone git://anongit.freedesktop.org/drm-intel

and then, for instance, checkout drm-intel-nightly
Comment 17 Mike Lothian 2016-07-02 09:44:11 UTC
That does indeed fix the issue 

Thanks
Comment 18 yann 2016-07-05 10:06:15 UTC
Patch send for review https://patchwork.freedesktop.org/patch/95067/
Comment 19 Chris Wilson 2016-07-08 12:39:08 UTC
commit 48f112fed3b07858f1b3a78548d23320fb96747b
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Jun 24 14:07:14 2016 +0100

    drm/i915: Fill unused GGTT with scratch pages for VT-d
    
    One of the numerous VT-d workarounds we require is that the display
    hardware reads past the end of the buffer triggering VT-d faults. This
    is acknowledged in the code as being safe "since we fill the unused
    portions of the GGTT with the scratch page". Alas, that is no longer
    always true and so we trigger DMAR read faults.
    
    Skylake also requires another workaround to avoid mixing VT-d and
    unpopulated PTE, and so there we also need to ensure we fill unused
    entries with the scratch page.
    
    Reported-by: Mike Lothian <mike@fireburn.co.uk>
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=96584
    Fixes: f7770bfd9fd2 ("drm/i915: Skip clearing the GGTT on full-ppgtt systems")
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: David Weinehall <david.weinehall@intel.com>
    Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
    Link: http://patchwork.freedesktop.org/patch/msgid/1466773634-8106-1-git-send-email-chris@chris-wilson.co.uk
    Reviewed-by: David Weinehall <david.weinehall@intel.com>


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.