Bug 104748 - GPU hang occurs when running multi-encoders(h264) on haswell platform
Summary: GPU hang occurs when running multi-encoders(h264) on haswell platform
Status: CLOSED WORKSFORME
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: high blocker
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
: 104747 (view as bug list)
Depends on:
Blocks:
 
Reported: 2018-01-23 01:48 UTC by zhoubo
Modified: 2018-04-25 11:06 UTC (History)
1 user (show)

See Also:
i915 platform: HSW
i915 features: GPU hang


Attachments
dmesg log and i915_error_state (48.86 KB, application/vnd.rar)
2018-01-23 01:49 UTC, zhoubo
no flags Details

Description zhoubo 2018-01-23 01:48:00 UTC
25.271395] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[   38.785285] [drm] GPU HANG: ecode 7:0:0x8edcfff1, in TSK_VEncode4 [1674], reason: Hang on render ring, action: reset
[   38.817146] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[   38.848304] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[   38.879736] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[   38.912668] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[   38.944964] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[   44.655797] EXT4-fs (ram0): re-mounted. Opts: (null)
[   47.697675] drm/i915: Resetting chip after gpu hang
Comment 1 zhoubo 2018-01-23 01:49:13 UTC
Created attachment 136914 [details]
dmesg log and i915_error_state
Comment 2 zhoubo 2018-01-23 01:54:49 UTC
the version info is "Linux haswell 4.8.0haswell #16 SMP Wed Nov 15 15:44:20 CST 2017 x86_64 GNU/Linux"
Comment 3 zhoubo 2018-01-23 01:58:56 UTC
the version info is "Linux haswell 4.8.0haswell #16 SMP Wed Nov 15 15:44:20 CST 2017 x86_64 GNU/Linux"
Comment 4 zhoubo 2018-01-23 02:00:49 UTC
VAAPI version is 1.8.3
libdrm version is 2.4.81
intel-vaapi-driver version is 1.8.3
Comment 5 Elizabeth 2018-01-23 18:37:45 UTC
*** Bug 104747 has been marked as a duplicate of this bug. ***
Comment 6 Elizabeth 2018-01-24 18:43:05 UTC
(In reply to Elizabeth from comment #5)
> *** Bug 104747 has been marked as a duplicate of this bug. ***

(In reply to zhoubo from comment #3)
> (In reply to Elizabeth from comment #1)
> > Hello Zhoubo. If reproducible, could you try a more recent kernel
> > https://www.kernel.org? Thanks.
> 
> I find the reason cause gpu hang may be encode rate control mode.
> First I choose VBR mode, gpu hang occurs in most 10 mins.
> Then I choose CBR mode, gpu hang doesn't occur again.
> I find some difference in i965 driver, but I don't confirm which one is the
> bug.
> 
>   START: 0x00312000
>   HEAD:  0x02006e30
>   TAIL:  0x00008848
>   CTL:   0x0001f001
>   HWS:   0x00311000
>   ACTHD: 0x00000000 6a934b44
>   IPEIR: 0x00000000
>   IPEHR: 0x71000007
>   INSTDONE: 0xffdcffff
>   BBADDR: 0x00000000 6a934b45
>   BB_STATE: 0x00000120
>   INSTPS: 0x80000208
>   INSTPM: 0x00006080
>   FADDR: 0x00000000 6a934d00
> 
> according to the gpu hang info, I found "IPEHR: 0x71000007" means the gpu
> was hang at this address. And I found the param might be error in function 
>  "gen75_mfc_batchbuffer_emit_object_command" or
> "gen75_vme_fill_vme_batchbuffer" because of "*command_ptr++ =
> (CMD_MEDIA_OBJECT | (9 - 2))".
> 
> So I think the reason may be some param different in CRB and VBR, and it
> lead to different in  cmd "CMD_MEDIA_OBJECT", finally gpu hang occured.
> Is this right?
> If it's right ,could you help me find which is the bug ?

Hello Zhoubo, I believe both bugs have the same root cause, that's why I marked them as duplicated. According to your logs, your using a 4.8 kernel while actual   stable release is 4.14+, so issue could be already fixed by any recent commit. Could you please try 4.14+ or drm-tip kernels. Also you may want to visit https://01.org/linuxgraphics/community and give a try do the dev-community in irc to consult about you issue.
Comment 7 zhoubo 2018-01-25 07:27:59 UTC
Elizabeth,
   Thanks for your help.we are trying to use 4.14.15 to test if the bug will be solved.
   But we need to take a high risk to update the kernel even if it works because our product is close to release. So could you give me a patch or the commit log to solve the bug?
Comment 8 Elizabeth 2018-01-25 17:43:35 UTC
(In reply to zhoubo from comment #7)
> Elizabeth,
>    Thanks for your help.we are trying to use 4.14.15 to test if the bug will
> be solved.
>    But we need to take a high risk to update the kernel even if it works
> because our product is close to release. So could you give me a patch or the
> commit log to solve the bug?
Hello again Zhoubo. 
In this cases, if you have an agreement with Intel you can bump priority to speed this up by escalating this issue using proper internal channel, otherwise if kernel 4.15 fixes the issue I recommend you to compare the code between 4.8 and 4.15 where you suspect that the issue is affected to see what changes fixes it.
Thank you.
Comment 9 Jani Saarinen 2018-03-29 07:12:00 UTC
First of all. Sorry about spam.
This is mass update for our bugs. 

Sorry if you feel this annoying but with this trying to understand if bug still valid or not.
If bug investigation still in progress, please ignore this and I apologize!

If you think this is not anymore valid, please comment to the bug that can be closed.
If you haven't tested with our latest pre-upstream tree(drm-tip), can you do that also to see if issue is valid there still and if you cannot see issue there, please comment to the bug.
Comment 10 Jani Saarinen 2018-04-25 11:06:00 UTC
Closing, please re-open is issue still exists.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.