Bug 102296

Summary:

R9 285 VCE corruption since drm/amdgpu/gmc8: use the vram location programmed by the vbios

Product:

DRI

Reporter:

Andy Furniss <adf.lists>

Component:

DRM/AMDgpu

Assignee:

Default DRI bug account <dri-devel>

Status:

CLOSED FIXED

QA Contact:

Severity:

normal

Priority:

medium

CC:

ckoenig.leichtzumerken

Version:

DRI git

Hardware:

Other

OS:

All

Whiteboard:

i915 platform:

i915 features:

Attachments:

Description	Flags
dmesg on bad	none
dmesg on good (commit before bad dmesg)	none
possible fix	none
dmesg with patch	none
possible fix v2	none
dmesg with v2 patch	none

Description Andy Furniss 2017-08-18 14:43:10 UTC

Bit late with this, bisected on drm-next-4.13-wip

R9 285 Tonga I am getting corrupted output from VCE encode, omx or vaapi since below commit.

To re-produce this you need to use gstreamer and encode "fast and large" eg. 2160p from raw nv12.

Slow things like ffmpeg or gst-vaapi without ! queue ! seem to hide the issue somewhat.

26d4ac55d2260f8685475b3f6e76e276a238cca7 is the first bad commit
commit 26d4ac55d2260f8685475b3f6e76e276a238cca7
Author: Alex Deucher <alexander.deucher@amd.com>
Date:   Tue Nov 1 13:08:33 2016 -0400

    drm/amdgpu/gmc8: use the vram location programmed by the vbios
    
    This makes mc programming much simpler in future patches.
    
    Since evergreen, the vbios has been programming the fb location
    to the proper vram size.  The only reason to reprogram it would
    be to change the location.

Comment 1 Alex Deucher 2017-08-18 15:06:50 UTC

Please attach your dmesg output.  Are there any error messages in the output?

Comment 2 Andy Furniss 2017-08-18 15:31:21 UTC

diff of (cut) dmesg-good  dmesg-bad shows amongst other things

< amdgpu 0000:01:00.0: VRAM: 2048M 0x0000000000000000 - 0x000000007FFFFFFF (2048M used)
< amdgpu 0000:01:00.0: GTT: 3072M 0x0000000080000000 - 0x000000013FFFFFFF
---
> amdgpu 0000:01:00.0: VRAM: 2048M 0x000000F400000000 - 0x000000F47FFFFFFF (2048M used)
> amdgpu 0000:01:00.0: GTT: 3072M 0x0000000000000000 - 0x00000000BFFFFFFF
826c826
< [drm] PCIE GART of 3072M enabled (table at 0x0000000000040000).
---
> [drm] PCIE GART of 3072M enabled (table at 0x000000F400040000).

Comment 3 Andy Furniss 2017-08-18 15:31:59 UTC

Created attachment 133608 [details]
dmesg on bad

Comment 4 Andy Furniss 2017-08-18 15:33:10 UTC

Created attachment 133609 [details]
dmesg on good (commit before bad dmesg)

Comment 5 Alex Deucher 2017-08-18 15:45:59 UTC

Created attachment 133610 [details] [review]
possible fix

Does this patch help?

Comment 6 Andy Furniss 2017-08-18 16:27:38 UTC

Created attachment 133611 [details]
dmesg with patch

No, the encode fails differently though, throwing lots of

amdgpu: The CS has been cancelled because the context is lost.

and in dmesg

[  103.116736] [drm:amdgpu_vce_cs_reloc [amdgpu]] *ERROR* BO to small for addr 0x010cf1e000 156 155

This is actually familiar looking as current mesa + vaapi would do this since a patch from march.

I am testing this using OMX and have never see that do it before. The issue I bisected was outputting with no errors from the encoder, a corrupt stream - it was playable and looked good to start with, it just degraded as time went on with the decoder throwing h264 errors.

Comment 7 Alex Deucher 2017-08-18 16:54:14 UTC

Created attachment 133612 [details] [review]
possible fix v2

Whoops, the original patch had a typo in it.  Does this simplified version work any better?

Comment 8 Andy Furniss 2017-08-18 17:25:15 UTC

Created attachment 133613 [details]
dmesg with v2 patch

No luck with v2.
The errors are gone, but the original issue is the same.

Comment 9 Christian König 2017-08-18 17:59:24 UTC

Already following this Alex, but not the slightest idea either.

Andy could you for a test disable multiple instance support in VCE (I need to dig through the Mesa source as well, but I think Leo asked that multiple times so you might know of hand).

Apart from that I would say lets dump all the calculated addresses with good and bad and see what is different.

Comment 10 Andy Furniss 2017-08-18 18:24:39 UTC

Disabling dual instance does avoid it.

Comment 11 Andy Furniss 2017-09-19 14:59:20 UTC

This seems work OK on current drm-next-4.15-wip, don't know if it's luck or not yet. Perf is very slightly lower and I haven't been testing every iteration of new kernels due to testing vce stuff.

There is also an unrelated to vce, powerplay/display regression on this kernel, which I'll try to find later and file a bug.

Comment 12 Andy Furniss 2017-09-19 15:16:20 UTC

Oops the issue does still exist.
I pasted the wrong command line, which also explains why it was slightly slower.

Comment 13 Andy Furniss 2017-09-19 15:24:43 UTC

Re-reading this I notice I didn't paste the full diff between good and bad so here's a bit more -

diff good bad though other rings do vary a bit in the second field, ring 12 (VCE?) is the only one that's different in the first field.

< amdgpu 0000:01:00.0: fence driver on ring 12 use gpu addr 0x0000000000821f40, cpu addr 0xffffc9000364ef40
---
> amdgpu 0000:01:00.0: fence driver on ring 12 use gpu addr 0x000000f400821f40, cpu addr 0xffffc9000104ef40

Comment 14 Andy Furniss 2018-01-17 19:21:21 UTC

OK with current 4.17-wip

Comment 15 Christian König 2018-01-18 17:59:40 UTC

Good that we finally found the root cause and thanks for testing.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.