105964 – The bit2bit check failed on drm-tip for stream mbcode/mbstat/mvout AVC FEI encode

Bug 105964 - The bit2bit check failed on drm-tip for stream mbcode/mbstat/mvout AVC FEI encode

Summary: The bit2bit check failed on drm-tip for stream mbcode/mbstat/mvout AVC FEI en...

Status:	CLOSED INVALID

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	unspecified
Hardware:	Other Linux (All)

Importance:	medium major
Assignee:	Intel GFX Bugs mailing list
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:	ReadyForDev
Keywords:

Depends on:
Blocks:

Reported:	2018-04-10 09:22 UTC by Owen Zhang
Modified:	2018-05-03 06:10 UTC (History)
CC List:	3 users (show)

See Also:
i915 platform:	BDW
i915 features:	GEM/Other

Attachments
reproducer_package (976 bytes, application/x-shellscript) 2018-04-10 09:22 UTC, Owen Zhang	no flags	Details
sample_fei binary (814.68 KB, application/x-sharedlib) 2018-04-10 09:23 UTC, Owen Zhang	no flags	Details
input stream (288.98 MB, application/x-zip-compressed) 2018-04-10 10:51 UTC, Owen Zhang	no flags	Details
View All

Description Owen Zhang 2018-04-10 09:22:38 UTC

Created attachment 138721 [details]
reproducer_package

When Running AVC FEI encode workloads, we find the output data:mbcode/mbstat/mvout are different after each test for same input stream.

the expected result: mbcode/mbstat/mvout are same for each test.

this issue only reproduce on Broadwell hardware, hasn't found on SKL.
*the Video Device: Iris Graphics 6100(0x162b)

-----------------
this issue is a regression, we had bisect the issue patch:
https://patchwork.freedesktop.org/patch/162046/

[CI,04/10] drm/i915: Eliminate lots of iterations over the execobjects array
The major scaling bottleneck in execbuffer is the processing of the
execobjects. Creating an auxiliary list is inefficient when compared to
using the execobject array we already have allocated.

Reservation is then split into phases. As we lookup up the VMA, we
try and bind it back into active location. Only if that fails, do we add
it to the unbound list for phase 2. In phase 2, we try and add all those
objects that could not fit into their previous location, with fallback
to retrying all objects and evicting the VM in case of severe
fragmentation. (This is the same as before, except that phase 1 is now
done inline with looking up the VMA to avoid an iteration over the
execobject array. In the ideal case, we eliminate the separate reservation
phase). During the reservation phase, we only evict from the VM between
passes (rather than currently as we try to fit every new VMA). In
testing with Unreal Engine's Atlantis demo which stresses the eviction
logic on gen7 class hardware, this speed up the framerate by a factor of
2.

The second loop amalgamation is between move_to_gpu and move_to_active.
As we always submit the request, even if incomplete, we can use the
current request to track active VMA as we perform the flushes and
synchronisation required.

The next big advancement is to avoid copying back to the user any
execobjects and relocations that are not changed.

v2: Add a Theory of Operation spiel.
v3: Fall back to slow relocations in preparation for flushing userptrs.
v4: Document struct members, factor out eb_validate_vma(), add a few
more comments to explain some magic and hide other magic behind macros.

Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
---
 drivers/gpu/drm/i915/i915_drv.h                 |    2 +-
 drivers/gpu/drm/i915/i915_gem_evict.c           |   92 +-
 drivers/gpu/drm/i915/i915_gem_execbuffer.c      | 2042 +++++++++++++----------
 drivers/gpu/drm/i915/i915_vma.c                 |    2 +-
 drivers/gpu/drm/i915/i915_vma.h                 |    1 +
 drivers/gpu/drm/i915/selftests/i915_gem_evict.c |    4 +-
 drivers/gpu/drm/i915/selftests/i915_vma.c       |   16 +-
 7 files changed, 1241 insertions(+), 918 deletions(-)

-----------------
this issue can be reproduced in drm-tip.
the last commit is:
commit 617cdf0bd4fd2cb0dcc64ddf07fbb56572ba800a
Author: Eric Anholt <eric@anholt.net>
Date:   Mon Apr 9 12:59:13 2018 -0700

    drm-tip: 2018y-04m-09d-19h-55m-54s UTC integration manifest


the reproduce steps:
1) Build this stack: https://software.intel.com/en-us/articles/build-and-debug-open-source-media-stack
2) ./repr.sh
the result will show the "FAILED".

Comment 1 Owen Zhang 2018-04-10 09:23:37 UTC

Created attachment 138722 [details]
sample_fei binary

Comment 2 Martin Peres 2018-04-10 10:09:23 UTC

Thanks for your detailed bug report. I assigned the bug to Chris since it has been bisected to his patch.

Comment 3 Chris Wilson 2018-04-10 10:11:34 UTC

Check your userspace code very carefully, for it is buggy ;)

Comment 4 Chris Wilson 2018-04-10 10:13:01 UTC

The usual explanation here is either a missing write hazard, an invalid relocation, or reusing a stale relocation value before updating it later in the batch.

Comment 5 Owen Zhang 2018-04-10 10:51:37 UTC

Created attachment 138724 [details]
input stream

Comment 6 Owen Zhang 2018-04-10 13:29:53 UTC

thanks for your reply, Is it any difference for BDW and SKL? since we can't reproduce this issue on SKL using same userspace libnaries.
(In reply to Chris Wilson from comment #4)
> The usual explanation here is either a missing write hazard, an invalid
> relocation, or reusing a stale relocation value before updating it later in
> the batch.

Comment 7 Jani Saarinen 2018-04-25 11:54:33 UTC

Chris, is this valid bug?

Comment 8 Chris Wilson 2018-04-29 20:47:55 UTC

In your code you make the assumption that offset 0 is empty; either leaving an address pointing to 0 or by omission of a relocation/patch. If it is the same bug as last time it is because you are trying to use a relative offset of 0 before you specify the base addresses.

Comment 9 Jani Saarinen 2018-04-30 07:52:35 UTC

Closing, please re-open if occurs again.

Comment 10 Owen Zhang 2018-05-02 03:18:13 UTC

Hi,

I'm sorry for this, i need re-open this issue, due to i checked the userspace code, we haven't specified the offset 0. but i have the following one experient:
i changed the following code in libdrm:
bo_gem->relocs[bo_gem->reloc_count].presumed_offset = -1;

I think all the offsets from KMD are invalided, all the BOs need to do the reloc in KMD, Does this understanding right? thanks a lot.

Comment 11 Jani Saarinen 2018-05-02 07:06:56 UTC

Jani, Chris, any help here?

Comment 12 Jani Saarinen 2018-05-03 05:48:27 UTC

Owen, what was reason to resolve this now?

Comment 13 Owen Zhang 2018-05-03 06:04:55 UTC

in fact, we haven't solved this issue currently, but as Chris mentioned, the userspace need to deal with the bo which relocate to address 0. we are checking the UMD code again. so i set this status to close. thanks.
  
(In reply to Jani Saarinen from comment #12)
> Owen, what was reason to resolve this now?

Comment 14 Jani Saarinen 2018-05-03 06:10:52 UTC

OK, thanks.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.