Summary: | [SKL GT4e] 3D game nexuiz 1.6.1 causes GPU HANG | ||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | binx.wu | ||||||||||||||||||
Component: | DRM/Intel | Assignee: | mwa <matthew.auld> | ||||||||||||||||||
Status: | CLOSED FIXED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> | ||||||||||||||||||
Severity: | critical | ||||||||||||||||||||
Priority: | highest | CC: | gordon.jin, intel-gfx-bugs, knikkane, terrence.xu | ||||||||||||||||||
Version: | DRI git | ||||||||||||||||||||
Hardware: | x86-64 (AMD64) | ||||||||||||||||||||
OS: | Linux (All) | ||||||||||||||||||||
Whiteboard: | |||||||||||||||||||||
i915 platform: | SKL | i915 features: | GPU hang | ||||||||||||||||||
Attachments: |
|
Description
binx.wu
2016-05-25 06:07:51 UTC
Created attachment 124070 [details]
/sys/class/drm/card0/error
file of /sys/class/drm/card0/error
Created attachment 124099 [details]
dmesg with drm.debug=0x0e
Sorry for forget upload this log
I really have no idea, but there seems to be e.g. this workaround still pending, please try it: http://patchwork.freedesktop.org/patch/msgid/1465816501-25557-1-git-send-email-tim.gore@intel.com Also, the dmesg is missing the beginning, including the i915 device info parts. Could you please try with latest drm-intel-nightly from git://anongit.freedesktop.org/drm-intel Some skl specific workarounds got their scope extended to more modern revisions just recently. Created attachment 124603 [details] /var/log/kern.log file commit 3aaddcdb189ded5595700cf07e2e0991bfb812c7 Author: Daniel Vetter <daniel.vetter@ffwll.ch> Date: Sun Jun 19 10:40:16 2016 +0200 drm-intel-nightly: 2016y-06m-19d-08h-39m-57s UTC integration manifest Still exist [ 488.731857] [drm] RC6 on [ 496.732046] [drm] stuck on render ring [ 496.732241] [drm] GPU HANG: ecode 9:0:0x85dffffb, in nexuiz-linux-x8 [3325], reason: Engine(s) hung, action: reset [ 496.732290] [drm:i915_reset_and_wakeup] resetting chip [ 496.734046] drm/i915: Resetting chip after gpu hang [ 496.734052] [drm:gen8_init_common_ring] Execlists enabled for render ring [ 496.734070] [drm:gen8_init_common_ring] Execlists enabled for blitter ring [ 496.734085] [drm:gen8_init_common_ring] Execlists enabled for bsd ring [ 496.734099] [drm:gen8_init_common_ring] Execlists enabled for bsd2 ring [ 496.734114] [drm:gen8_init_common_ring] Execlists enabled for video enhancement ring [ 496.734135] [drm:intel_guc_setup] GuC fw status: path i915/skl_guc_ver6_1.bin, fetch FAIL, load NONE [ 496.734138] [drm] GuC firmware load failed: -5 # glxinfo |grep "renderer string" OpenGL renderer string: Mesa DRI Intel(R) Iris Pro Graphics P580 (Skylake GT4e) I already add the full log from /var/log/kern.log file OS: ubuntu 16.04 desktop Change the status to assigned since we still can reproduce it in the newest code from drm-intel-nightly and provided the newest dmesg. Kuoppala, do you need any more information? (In reply to Terrence Xu from comment #6) > Change the status to assigned since we still can reproduce it in the newest > code from drm-intel-nightly and provided the newest dmesg. And the patch from comment #3? (In reply to Jani Nikula from comment #7) > (In reply to Terrence Xu from comment #6) > > Change the status to assigned since we still can reproduce it in the newest > > code from drm-intel-nightly and provided the newest dmesg. > > And the patch from comment #3? Hi Kikula, This patch already existed in the drm-intel-nightly branch: commit a8ab5ed5e1bf856eceaab5579236de6f92822b9f Author: Tim Gore <tim.gore@intel.com> Date: Mon Jun 13 12:15:01 2016 +0100 drm/i915/gen9: implement WaConextSwitchWithConcurrentTLBInvalidate This patch enables a workaround for a mid thread preemption issue where a hardware timing problem can prevent the context restore from happening, leading to a hang. v2: move to gen9_init_workarounds (Arun) v3: move to start of gen9_init_workarounds (Arun) Signed-off-by: Tim Gore <tim.gore@intel.com> Reviewed-by: Arun Siluvery <arun.siluvery@linux.intel.com> Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/1465816501-25557-1-git-send-email-tim.gore@intel.com Still met the GPU hang issue: [ 114.965619] [drm] stuck on render ring [ 114.970172] [drm] GPU HANG: ecode 9:0:0xfffffffe, in nexuiz-linux-x8 [3252], reason: Engine(s) hung, action: reset [ 114.981934] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. [ 114.992402] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel [ 115.002481] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue. [ 115.013447] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it. [ 115.023619] [drm] GPU crash dump saved to /sys/class/drm/card0/error [ 115.031127] [drm:i915_reset_and_wakeup] resetting chip [ 115.039093] drm/i915: Resetting chip after gpu hang [ 115.044679] [drm:gen8_init_common_ring] Execlists enabled for render ring [ 115.052422] [drm:gen8_init_common_ring] Execlists enabled for blitter ring [ 115.060257] [drm:gen8_init_common_ring] Execlists enabled for bsd ring [ 115.067692] [drm:gen8_init_common_ring] Execlists enabled for bsd2 ring [ 115.075227] [drm:gen8_init_common_ring] Execlists enabled for video enhancement ring [ 115.084045] [drm:intel_guc_setup] GuC fw status: path i915/skl_guc_ver6_1.bin, fetch FAIL, load NONE [ 115.094438] [drm] GuC firmware load failed: -5 [ 116.977464] [drm] RC6 on [ 124.977716] [drm] stuck on render ring [ 124.982275] [drm] GPU HANG: ecode 9:0:0xfffffffe, in nexuiz-linux-x8 [3252], reason: Engine(s) hung, action: reset [ 124.994066] [drm:i915_reset_and_wakeup] resetting chip [ 125.002035] drm/i915: Resetting chip after gpu hang [ 125.007627] [drm:gen8_init_common_ring] Execlists enabled for render ring [ 125.015376] [drm:gen8_init_common_ring] Execlists enabled for blitter ring [ 125.023211] [drm:gen8_init_common_ring] Execlists enabled for bsd ring [ 125.030652] [drm:gen8_init_common_ring] Execlists enabled for bsd2 ring [ 125.038195] [drm:gen8_init_common_ring] Execlists enabled for video enhancement ring [ 125.047033] [drm:intel_guc_setup] GuC fw status: path i915/skl_guc_ver6_1.bin, fetch FAIL, load NONE [ 125.057420] [drm] GuC firmware load failed: -5 [ 125.126724] [drm:skl_update_scaler_plane] Updating scaler for [PLANE:23:plane 1A] scaler_user index 0.0 [ 125.149208] [drm:skl_update_scaler_plane] Updating scaler for [PLANE:23:plane 1A] scaler_user index 0.0 [ 125.165956] [drm:skl_update_scaler_plane] Updating scaler for [PLANE:23:plane 1A] scaler_user index 0.0 [ 125.182720] [drm:skl_update_scaler_plane] Updating scaler for [PLANE:23:plane 1A] scaler_user index 0.0 [ 125.199386] [drm:skl_update_scaler_plane] Updating scaler for [PLANE:23:plane 1A] scaler_user index 0.0 [ 125.216057] [drm:skl_update_scaler_plane] Updating scaler for [PLANE:23:plane 1A] scaler_user index 0.0 [ 125.232721] [drm:skl_update_scaler_plane] Updating scaler for [PLANE:23:plane 1A] scaler_user index 0.0 [ 125.454492] [drm:skl_wm_flush_pipe] flush pipe A (pass 3) [ 125.466015] DMAR: DRHD: handling fault status reg 3 [ 125.471564] DMAR: [DMA Read] Request device [00:02.0] fault addr f9827000 [fault reason 06] PTE Read access is not set [ 125.472089] [drm:skl_update_scaler_plane] Updating scaler for [PLANE:23:plane 1A] scaler_user index 0.0 [ 125.489039] [drm:skl_update_scaler_plane] Updating scaler for [PLANE:23:plane 1A] scaler_user index 0.0 [ 125.505019] [drm:skl_update_scaler_plane] Updating scaler for [PLANE:23:plane 1A] scaler_user index 0.0 [ 125.515840] DMAR: DRHD: handling fault status reg 3 [ 125.521408] DMAR: [DMA Read] Request device [00:02.0] fault addr f9827000 [fault reason 06] PTE Read access is not set [ 125.549366] DMAR: DRHD: handling fault status reg 3 [ 125.554919] DMAR: [DMA Read] Request device [00:02.0] fault addr f9827000 [fault reason 06] PTE Read access is not set [ 125.582693] DMAR: DRHD: handling fault status reg 3 [ 125.588257] DMAR: [DMA Read] Request device [00:02.0] fault addr f9827000 [fault reason 06] PTE Read access is not set [ 125.608688] [drm:skl_update_scaler_plane] Updating scaler for [PLANE:23:plane 1A] scaler_user index 0.0 [ 125.619399] DMAR: DRHD: handling fault status reg 3 [ 125.619401] DMAR: [DMA Read] Request device [00:02.0] fault addr f9827000 [fault reason 06] PTE Read access is not set [ 125.637102] DMAR: DRHD: handling fault status reg 3 [ 125.637104] DMAR: [DMA Read] Request device [00:02.0] fault addr f9827000 [fault reason 06] PTE Read access is not set [ 125.654799] DMAR: DRHD: handling fault status reg 3 [ 125.654800] DMAR: [DMA Read] Request device [00:02.0] fault addr f9827000 [fault reason 06] PTE Read access is not set [ 125.672491] DMAR: DRHD: handling fault status reg 3 [ 125.672492] DMAR: [DMA Read] Request device [00:02.0] fault addr f9827000 [fault reason 06] PTE Read access is not set [ 125.690191] DMAR: DRHD: handling fault status reg 3 [ 125.690193] DMAR: [DMA Read] Request device [00:02.0] fault addr f9827000 [fault reason 06] PTE Read access is not set [ 125.707890] DMAR: DRHD: handling fault status reg 3 [ 125.707891] DMAR: [DMA Read] Request device [00:02.0] fault addr f9827000 [fault reason 06] PTE Read access is not set [ 125.735986] [drm:skl_update_scaler_plane] Updating scaler for [PLANE:23:plane 1A] scaler_user index 0.0 [ 125.753193] [drm:skl_update_scaler_plane] Updating scaler for [PLANE:23:plane 1A] scaler_user index 0.0 [ 125.768880] [drm:skl_update_scaler_plane] Updating scaler for [PLANE:23:plane 1A] scaler_user index 0.0 [ 125.785399] [drm:skl_update_scaler_plane] Updating scaler for [PLANE:23:plane 1A] scaler_user index 0.0 [ 125.802125] [drm:skl_update_scaler_plane] Updating scaler for [PLANE:23:plane 1A] scaler_user index 0.0 [ 125.818760] [drm:skl_update_scaler_plane] Updating scaler for [PLANE:23:plane 1A] scaler_user index 0.0 [ 125.835536] [drm:skl_update_scaler_plane] Updating scaler for [PLANE:23:plane 1A] scaler_user index 0.0 [ 125.852006] [drm:skl_update_scaler_plane] Updating scaler for [PLANE:23:plane 1A] scaler_user index 0.0 [ 126.297968] [drm:skl_update_scaler_plane] Updating scaler for [PLANE:23:plane 1A] scaler_user index 0.0 [ 126.898794] [drm:skl_update_scaler_plane] Updating scaler for [PLANE:23:plane 1A] scaler_user index 0.0 [ 126.965872] [drm] RC6 on [ 127.500350] [drm:skl_update_scaler_plane] Updating scaler for [PLANE:23:plane 1A] scaler_user index 0.0 [ 128.101283] [drm:skl_update_scaler_plane] Updating scaler for [PLANE:23:plane 1A] scaler_user index 0.0 [ 128.702096] [drm:skl_update_scaler_plane] Updating scaler for [PLANE:23:plane 1A] scaler_user index 0.0 [ 129.303064] [drm:skl_update_scaler_plane] Updating scaler for [PLANE:23:plane 1A] scaler_user index 0.0 [ 129.904004] [drm:skl_update_scaler_plane] Updating scaler for [PLANE:23:plane 1A] scaler_user index 0.0 [ 130.482775] dmar_fault: 286 callbacks suppressed [ 130.488056] DMAR: DRHD: handling fault status reg 3 Two dozen or so workarounds affecting skl/kbl went to nightly by the end of July. Could you please respin with latest on git://anongit.freedesktop.org/drm-intel ? I tried to see if I could reproduce this on a SKL GT4e running a recent -nightly but with no luck. try to load with intel_iommu=igfx_off. Then reproduce 3 times and upload all error states here. Thanks The bad news, we still can reproduce it with the latest drm-intel-nightly code with intel_iommu=igfx_off and the commit as below: commit 9561f5c5e1918cfaeb2a39f90eed046730ae7399 Author: Daniel Vetter <daniel.vetter@ffwll.ch> Date: Tue Jul 12 17:15:07 2016 +0200 drm-intel-nightly: 2016y-07m-12d-15h-14m-43s UTC integration manifest The corresponding attachment is 0713-1.log, 0713-2.log and 0713-3.log. Created attachment 125046 [details]
0713-1.log
Created attachment 125047 [details]
0713-2.log
Created attachment 125048 [details]
0713-3.log
hmm, so I managed to reproduce this, but only when I disable GuC submission, which by the look of your logs is what is also happening, though this is because it can't find the firmware and not that it has been intentionally disabled. Is there any particular reason why you are not using the GuC? Would you be able to test it with the GuC loaded? Nevertheless there does seem to be a bug when falling back to execlist mode on the SKL GT4e... Hello mwa, The bad news is I also reproduced this issue after I downloaded the guc firmware and enabled it in i915. The error log as below: [ 252.697245] [drm] GPU HANG: ecode 9:0:0xfffffffe, in nexuiz-linux-x8 [2970], reason: Hang on render ring, action: reset [ 252.697247] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. [ 252.697248] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel [ 252.697249] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue. [ 252.697250] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it. [ 252.697251] [drm] GPU crash dump saved to /sys/class/drm/card0/error [ 252.697272] [drm:i915_reset_and_wakeup] resetting chip [ 252.697282] drm/i915: Resetting chip after gpu hang [ 252.697316] [drm:gen8_init_common_ring] Execlists enabled for render ring [ 252.697334] [drm:gen8_init_common_ring] Execlists enabled for blitter ring [ 252.697349] [drm:gen8_init_common_ring] Execlists enabled for bsd ring [ 252.697363] [drm:gen8_init_common_ring] Execlists enabled for bsd2 ring [ 252.697378] [drm:gen8_init_common_ring] Execlists enabled for video enhancement ring [ 252.697408] [drm:intel_guc_setup] GuC fw status: path i915/skl_guc_ver6_1.bin, fetch SUCCESS, load SUCCESS [ 252.697411] [drm:intel_guc_setup] GuC fw status: fetch SUCCESS, load PENDING [ 252.698532] [drm:guc_ucode_xfer_dma] DMA status 0x10, GuC status 0x8002f0ec [ 252.698534] [drm:guc_ucode_xfer_dma] returning 0 [ 252.698536] [drm:intel_guc_setup] GuC fw status: fetch SUCCESS, load SUCCESS [ 252.698550] [drm:select_doorbell_register] assigned normal priority doorbell id 0x0 [ 252.698551] [drm:select_doorbell_cacheline] selected doorbell cacheline 0x40, next 0x80, linesize 64 [ 252.698559] [drm:guc_client_alloc] new priority 2 client ffff8804898d9280: ctx_index 0 [ 252.698560] [drm:guc_client_alloc] doorbell id 0, cacheline offset 0x40 [ 254.696694] [drm] RC6 on [ 262.695933] [drm:i915_reset_and_wakeup] resetting chip [ 262.695944] drm/i915: Resetting chip after gpu hang [ 262.697760] [drm:gen8_init_common_ring] Execlists enabled for render ring [ 262.697787] [drm:gen8_init_common_ring] Execlists enabled for blitter ring [ 262.697808] [drm:gen8_init_common_ring] Execlists enabled for bsd ring [ 262.697828] [drm:gen8_init_common_ring] Execlists enabled for bsd2 ring [ 262.697847] [drm:gen8_init_common_ring] Execlists enabled for video enhancement ring [ 262.697882] [drm:intel_guc_setup] GuC fw status: path i915/skl_guc_ver6_1.bin, fetch SUCCESS, load SUCCESS [ 262.697886] [drm:intel_guc_setup] GuC fw status: fetch SUCCESS, load PENDING [ 262.701710] [drm:guc_ucode_xfer_dma] DMA status 0x10, GuC status 0x8002f0ec [ 262.701713] [drm:guc_ucode_xfer_dma] returning 0 [ 262.701715] [drm:intel_guc_setup] GuC fw status: fetch SUCCESS, load SUCCESS [ 262.701729] [drm:select_doorbell_register] assigned normal priority doorbell id 0x0 [ 262.701730] [drm:select_doorbell_cacheline] selected doorbell cacheline 0x80, next 0xc0, linesize 64 [ 262.703708] [drm:guc_client_alloc] new priority 2 client ffff8804898d9280: ctx_index 0 [ 262.703710] [drm:guc_client_alloc] doorbell id 0, cacheline offset 0x80 The guc status as below: root@igvt-1604:/sys/kernel/debug/dri/0# cat i915_guc_load_status GuC firmware status: path: i915/skl_guc_ver6_1.bin fetch: SUCCESS load: SUCCESS version wanted: 6.1 version found: 6.1 header: offset is 0; size = 128 uCode: offset is 128; size = 128640 RSA: offset is 128768; size = 256 GuC status 0x800300ec: Bootrom status = 0x76 uKernel status = 0x0 MIA Core status = 0x3 Scratch registers: 0: 0xf0000000 1: 0x0 2: 0x0 3: 0x5f5e100 4: 0x600 5: 0x0 6: 0x0 7: 0x8 8: 0x3 9: 0xd4a00 10: 0x0 11: 0x0 12: 0x0 13: 0x0 14: 0x0 15: 0x0 Created attachment 125127 [details] dmesg-with-guc-0718.log Attach the error dmesg with guc enabled. BTW, the GUC version is 6.1, download address: https://01.org/zh/linuxgraphics/downloads/skylake-guc-6.1 Okay, so after *lots* more investigation, this does not look like a kernel issue. It would seem the user-space component which is the root cause of the hang is in fact Mesa. The good news is that it seems to have been fixed, I tested on the latest master(9c63224) and the hang doesn't seem to present itself. Would you be able to also confirm this and report back? (In reply to mwa from comment #19) > Okay, so after *lots* more investigation, this does not look like a kernel > issue. It would seem the user-space component which is the root cause of the > hang is in fact Mesa. The good news is that it seems to have been fixed, I > tested on the latest master(9c63224) and the hang doesn't seem to present > itself. Would you be able to also confirm this and report back? You are right.:) The Ubuntu 16.04 default Mesa version is 11.2.0, and after I upgraded the Mesa to master(9c63224,12.1.0-devel), this issue disappeared, same result as Bug96177. Resolving this issue since fix is done by upgrading with latest Mesa. any idea what fixed it, and if so, will it be backported to 11.2.x/12.0.x? We have confirmed this issue disappeared on Both Ubuntu and Centos after Mesa upgrading. For reference the fix is: commit ddcfc35f62ed3ad83b100beacb5b30394dcd9960 Author: Ben Widawsky <ben@bwidawsk.net> Date: Thu May 26 11:04:07 2016 -0700 i965/sklgt4: Implement depth/timestamp write w/a The stated bug describes a scenario in which a post sync write operation for depth or timestamp can be ignored. There are two workarounds suggested, the first and easier is to simply do a cs stall when we do these type of writes. The second option is to do a PIPE_CONTROL flush after the post sync but before the data is required. Generally, I believe the data written out is consumed by the application on the CPU side and so doing the easier of the two is ideal. Furthermore, these queries aren't tremendously common in the perf sensitive apps I have looked at. However, there could be cases where a shader stage might directly consume the data, and as a result option 2 may be desirable. This patch goes with the easier solution for now. gen9lp bug_de_id=2137196 By itself, this does *not* fix any of the GT4 hangs we're currently experiencing. Cc: Mika Kuoppala <mika.kuoppala@intel.com> Signed-off-by: Ben Widawsky <ben@bwidawsk.net> Reviewed-by: Anuj Phogat <anuj.phogat@gmail.com> There are plans to backport the fix to the next Mesa stable release(should be 11.2.3). |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.