Summary: | R9 285 VCE corruption since drm/amdgpu/gmc8: use the vram location programmed by the vbios | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | Andy Furniss <adf.lists> | ||||||||||||||
Component: | DRM/AMDgpu | Assignee: | Default DRI bug account <dri-devel> | ||||||||||||||
Status: | CLOSED FIXED | QA Contact: | |||||||||||||||
Severity: | normal | ||||||||||||||||
Priority: | medium | CC: | ckoenig.leichtzumerken | ||||||||||||||
Version: | DRI git | ||||||||||||||||
Hardware: | Other | ||||||||||||||||
OS: | All | ||||||||||||||||
Whiteboard: | |||||||||||||||||
i915 platform: | i915 features: | ||||||||||||||||
Attachments: |
|
Description
Andy Furniss
2017-08-18 14:43:10 UTC
Please attach your dmesg output. Are there any error messages in the output? diff of (cut) dmesg-good dmesg-bad shows amongst other things < amdgpu 0000:01:00.0: VRAM: 2048M 0x0000000000000000 - 0x000000007FFFFFFF (2048M used) < amdgpu 0000:01:00.0: GTT: 3072M 0x0000000080000000 - 0x000000013FFFFFFF --- > amdgpu 0000:01:00.0: VRAM: 2048M 0x000000F400000000 - 0x000000F47FFFFFFF (2048M used) > amdgpu 0000:01:00.0: GTT: 3072M 0x0000000000000000 - 0x00000000BFFFFFFF 826c826 < [drm] PCIE GART of 3072M enabled (table at 0x0000000000040000). --- > [drm] PCIE GART of 3072M enabled (table at 0x000000F400040000). Created attachment 133608 [details]
dmesg on bad
Created attachment 133609 [details]
dmesg on good (commit before bad dmesg)
Created attachment 133610 [details] [review] possible fix Does this patch help? Created attachment 133611 [details]
dmesg with patch
No, the encode fails differently though, throwing lots of
amdgpu: The CS has been cancelled because the context is lost.
and in dmesg
[ 103.116736] [drm:amdgpu_vce_cs_reloc [amdgpu]] *ERROR* BO to small for addr 0x010cf1e000 156 155
This is actually familiar looking as current mesa + vaapi would do this since a patch from march.
I am testing this using OMX and have never see that do it before. The issue I bisected was outputting with no errors from the encoder, a corrupt stream - it was playable and looked good to start with, it just degraded as time went on with the decoder throwing h264 errors.
Created attachment 133612 [details] [review] possible fix v2 Whoops, the original patch had a typo in it. Does this simplified version work any better? Created attachment 133613 [details]
dmesg with v2 patch
No luck with v2.
The errors are gone, but the original issue is the same.
Already following this Alex, but not the slightest idea either. Andy could you for a test disable multiple instance support in VCE (I need to dig through the Mesa source as well, but I think Leo asked that multiple times so you might know of hand). Apart from that I would say lets dump all the calculated addresses with good and bad and see what is different. Disabling dual instance does avoid it. This seems work OK on current drm-next-4.15-wip, don't know if it's luck or not yet. Perf is very slightly lower and I haven't been testing every iteration of new kernels due to testing vce stuff. There is also an unrelated to vce, powerplay/display regression on this kernel, which I'll try to find later and file a bug. Oops the issue does still exist. I pasted the wrong command line, which also explains why it was slightly slower. Re-reading this I notice I didn't paste the full diff between good and bad so here's a bit more -
diff good bad though other rings do vary a bit in the second field, ring 12 (VCE?) is the only one that's different in the first field.
< amdgpu 0000:01:00.0: fence driver on ring 12 use gpu addr 0x0000000000821f40, cpu addr 0xffffc9000364ef40
---
> amdgpu 0000:01:00.0: fence driver on ring 12 use gpu addr 0x000000f400821f40, cpu addr 0xffffc9000104ef40
OK with current 4.17-wip Good that we finally found the root cause and thanks for testing. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.