Bug 100098 - [SKL] HEVC GPU hang
Summary: [SKL] HEVC GPU hang
Status: CLOSED INVALID
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: Other All
: high critical
Assignee: Sergei
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-03-07 11:31 UTC by Sergei
Modified: 2017-06-26 21:27 UTC (History)
1 user (show)

See Also:
i915 platform: SKL
i915 features: GPU hang


Attachments
ZIP Arch of dmidecode.txt, modinfo_i915.txt and error files (71.35 KB, application/zip)
2017-03-07 11:31 UTC, Sergei
no flags Details

Description Sergei 2017-03-07 11:31:23 UTC
Created attachment 130111 [details]
ZIP Arch of dmidecode.txt, modinfo_i915.txt and error files

We see GPU hang which occurs when doing HW HEVC encoding (with MediaServerStudio 2017 R2). The issue happens randomly and usully after 3-6 minites after beggining of HEVC encoding (or transcoding to HEVC). The issue isn’t seen with AVC encoder. 

CPU: Intel(R) Core(TM) i7-6822EQ CPU @ 2.00GHz
OS: CentOS 7 
uname a: 3.10.0-327.el7.x86_64 #1 SMP Thu Nov 19 22:10:57 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
i915 is from MSS package. 

Command line:
./sample_multi_transcode -par smt.par  -timeout 18000
...
...............................................................................................................................................................................
[ERROR], sts=MFX_ERR_GPU_HANG(-21), PutBS, m_pmfxSession->SyncOperation failed at /home/lab_msdk/buildAgentDir/buildAgent_MediaSDK3/git/mdp_msdk-samples/samples/sample_multi_transcode/src/pipeline_transcode.cpp:1575

[ERROR], sts=MFX_ERR_GPU_HANG(-21), Transcode, PutBS failed at /home/lab_msdk/buildAgentDir/buildAgent_MediaSDK3/git/mdp_msdk-samples/samples/sample_multi_transcode/src/pipeline_transcode.cpp:1540
...
Common transcoding time is 551.229 sec
-------------------------------------------------------------------------------
*** session 0 FAILED (MFX_ERR_GPU_HANG) 217.337 sec, 3485 frames
-i::h264 out_1.h264 -o::h265 out_1.h265 -b 10000 

*** session 1 FAILED (MFX_ERR_GPU_HANG) 210.426 sec, 3486 frames
-i::h264 out_1.h264 -o::h265 out_1.h265 -b 10000 

*** session 2 FAILED (MFX_ERR_GPU_HANG) 551.228 sec, 17314 frames
-i::h264 out_2.h264 -o::h265 out_2.h265 -b 10000 

*** session 3 FAILED (MFX_ERR_GPU_HANG) 217.329 sec, 3485 frames
-i::h264 out_2.h264 -o::h265 out_2.h265 -b 10000 

*** session 4 FAILED (MFX_ERR_GPU_HANG) 551.228 sec, 17314 frames
-i::h264 out_3.h264 -o::h265 out_3.h265 -b 10000 

dmesg | grep drm
[    1.035529] drm_ukmd_compat: module verification failed: signature and/or required key missing - tainting kernel
[    1.036455] Initialized drm/i915 compat module 20161215-16.5.1-59511-k75a71d9
[    1.039089] [drm] Initialized drm 1.1.0 20060810
[    1.045545] [drm_ukmd] Initialized drm_ukmd module
[    1.227096] [drm_ukmd] Memory usable by graphics device = 4096M
[    1.227102] fb: conflicting fb hw usage inteldrmfb vs EFI VGA - removing generic driver
[    1.227230] [drm_ukmd] Replacing VGA console driver
[    1.234477] [drm_ukmd] Supports vblank timestamp caching Rev 2 (21.10.2013).
[    1.234478] [drm_ukmd] Driver supports precise vblank timestamp query.
[    1.234551] [drm_ukmd:i915_gem_init_stolen [i915]] *ERROR* conflict detected with stolen region: [0x8e000000 - 0x90000000]
[    1.961398] [drm_ukmd] RC6 disabled, disabling runtime PM support
[    1.961402] [drm_ukmd] Initialized i915 1.6.0 20161215-16.5.1-59511-k686851e for 0000:00:02.0 on minor 0
[    2.175155] fbcon: inteldrmfb (fb0) is primary device
[    2.419227] WARNING: at ../../../../qb/workspace/17023/p4gen/gfx_Development/builds/centos/_rpmbuild_tmp/BUILD/ukmd-16.5.1/drivers/gpu/drm/i915/intel_pm.c:3597 skl_update_other_pipe_wm+0x217/0x230 [i915]()
[    2.419241] Modules linked in: sd_mod crc_t10dif crct10dif_generic i915(OE) i2c_algo_bit drm_ukmd_kms_helper(OE) syscopyarea sysfillrect sysimgblt ahci fb_sys_fops e1000e libahci crct10dif_pclmul crct10dif_common crc32c_intel libata ptp drm_ukmd(OE) serio_raw pps_core drm(OE) drm_ukmd_compat(OE) video i2c_hid i2c_core dm_mirror dm_region_hash dm_log dm_mod
[    2.419399]  [<ffffffffa00f6be7>] _ukmd_drm_atomic_commit+0x37/0x60 [drm_ukmd]
[    2.419404]  [<ffffffffa0235f98>] restore_fbdev_mode+0x248/0x280 [drm_ukmd_kms_helper]
[    2.419409]  [<ffffffffa02381f3>] _ukmd_drm_fb_helper_restore_fbdev_mode_unlocked+0x33/0x80 [drm_ukmd_kms_helper]
[    2.419413]  [<ffffffffa023826c>] _ukmd_drm_fb_helper_set_par+0x2c/0x60 [drm_ukmd_kms_helper]
[    2.419474]  [<ffffffffa023853c>] _ukmd_drm_fb_helper_initial_config+0x29c/0x3f0 [drm_ukmd_kms_helper]
[    2.696982] i915 0000:00:02.0: fb0: inteldrmfb frame buffer device
[    3.730385] [drm_ukmd] RC6 off
[    3.732473] [drm_ukmd] The Ring/GT multiplier is 2
[    4.060722] SELinux: initialized (dev drm, type drm), not configured for labeling
[  601.278623] WARNING: at ../../../../qb/workspace/17023/p4gen/gfx_Development/builds/centos/_rpmbuild_tmp/BUILD/ukmd-16.5.1/drivers/gpu/drm/i915/intel_pm.c:3597 skl_update_other_pipe_wm+0x217/0x230 [i915]()
[  601.278676] Modules linked in: snd_hda_codec_hdmi vfat fat snd_hda_codec_realtek snd_hda_codec_generic intel_powerclamp coretemp intel_rapl snd_hda_intel kvm_intel kvm snd_hda_codec snd_hda_core snd_hwdep crc32_pclmul snd_seq ghash_clmulni_intel snd_seq_device snd_pcm aesni_intel lrw gf128mul glue_helper ppdev ablk_helper snd_timer cryptd snd soundcore sg cdc_acm pcspkr i2c_i801 parport_pc parport shpchp acpi_pad acpi_cpufreq ip_tables xfs libcrc32c hid_multitouch sd_mod crc_t10dif crct10dif_generic i915(OE) i2c_algo_bit drm_ukmd_kms_helper(OE) syscopyarea sysfillrect sysimgblt ahci fb_sys_fops e1000e libahci crct10dif_pclmul crct10dif_common crc32c_intel libata ptp drm_ukmd(OE) serio_raw pps_core drm(OE) drm_ukmd_compat(OE) video i2c_hid i2c_core dm_mirror dm_region_hash dm_log dm_mod
[  601.278907]  [<ffffffffa00f6be7>] _ukmd_drm_atomic_commit+0x37/0x60 [drm_ukmd]
[  601.278922]  [<ffffffffa0234c2c>] _ukmd_drm_atomic_helper_connector_dpms+0xfc/0x1b0 [drm_ukmd_kms_helper]
[  601.278935]  [<ffffffffa0237020>] drm_fb_helper_dpms.isra.9+0xa0/0xe0 [drm_ukmd_kms_helper]
[  601.278946]  [<ffffffffa0237099>] _ukmd_drm_fb_helper_blank+0x39/0xa0 [drm_ukmd_kms_helper]

GPU hang:
[ 2579.435226] [drm_ukmd] stuck on bsd ring
[ 2579.435671] [drm_ukmd] GPU HANG: ecode 9:1:0xc85efffe, in sample_multi_tr [2500], reason: Ring hung, action: reset
[ 2579.435674] [drm_ukmd] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 2579.435676] [drm_ukmd] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[ 2579.435679] [drm_ukmd] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 2579.435680] [drm_ukmd] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[ 2579.435683] [drm_ukmd] GPU crash dump saved to /sys/class/drm/card0/error
[ 2579.437919] drm/i915: Resetting chip after gpu hang
[ 2581.435005] [drm_ukmd] RC6 off
[ 2581.435047] [drm_ukmd] The Ring/GT multiplier is 2
[ 4187.041260] SELinux: initialized (dev tmpfs, type tmpfs), uses transition SIDs
[ 4269.242216] [drm_ukmd] stuck on bsd ring
[ 4269.242657] [drm_ukmd] GPU HANG: ecode 9:1:0xc85efffe, in sample_multi_tr [2516], reason: Ring hung, action: reset
[ 4269.244830] drm/i915: Resetting chip after gpu hang
[ 4271.241998] [drm_ukmd] RC6 off
[ 4271.242037] [drm_ukmd] The Ring/GT multiplier is 2

Dmidecode is attached.
/sys/class/drm/card0/error file is attached.
modinfo i915 is attached.
Comment 1 Chris Wilson 2017-03-07 11:51:12 UTC
3.10.0-327.el7 is a very unknown quantity. We know we have pushed required workarounds into the kernel, but we do not know if they have all been backported to that kernel. As a starting point, please reproduce on an upstream kernel, preferrably: https://cgit.freedesktop.org/drm-tip
Comment 2 Ricardo 2017-05-09 17:59:40 UTC
assigning bug to submitter to try to be reproduce in latest configuration. If there is no response with in 30 days the bug will be closed for lack of information.

if the problem persist in newest configuration please update with logs and also change the status to REOPEN

if the problem is no longer present please resolved this issue
Comment 3 Elizabeth 2017-06-26 21:27:25 UTC
(In reply to Ricardo from comment #2)
> assigning bug to submitter to try to be reproduce in latest configuration.
> If there is no response with in 30 days the bug will be closed for lack of
> information.
> 
> if the problem persist in newest configuration please update with logs and
> also change the status to REOPEN
> 
> if the problem is no longer present please resolved this issue

Closing bug.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.