Summary: | GPU Hang with single-channel RAM configuration | ||||||
---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | Mark Janes <mark.a.janes> | ||||
Component: | DRM/Intel | Assignee: | Intel GFX Bugs mailing list <intel-gfx-bugs> | ||||
Status: | CLOSED FIXED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> | ||||
Severity: | critical | ||||||
Priority: | medium | CC: | baker.dylan.c, clayton.a.craft, intel-gfx-bugs, mark.a.janes, martin.peres, mika.kuoppala, volodymyr.los | ||||
Version: | unspecified | ||||||
Hardware: | x86-64 (AMD64) | ||||||
OS: | Linux (All) | ||||||
Whiteboard: | |||||||
i915 platform: | ALL | i915 features: | GPU hang | ||||
Attachments: |
|
Description
Mark Janes
2018-02-12 21:20:46 UTC
Created attachment 137306 [details]
card error state
It completed the CS interrupt following seqno 0x00bd22f7, but the context switch (following ELSP) did not restart on the new context. ~o~ I'm not sure if a bisect will reveal anything more than a timing change. Initial guess is that this issue is related to the one fixed in commit ba74cb10c775c839f6e1d0fabd1e772eabd9c43f Author: Michel Thierry <michel.thierry@intel.com> Date: Mon Nov 20 12:34:58 2017 +0000 drm/i915/execlists: Delay writing to ELSP until HW has processed the previou s write Chris, do you think you can write a test case or a patch I can try? From the error-state, I would say it just takes stress, you just need to hit the right timing between CS interrupt and ELSP submission. If that is the case, igt should be able to trigger it indirectly. My assumption may well be off, but I would start with just running texture-gather-offset (or which ever piglit got caught in that GPU hang) in a loop; probably with a background "find / -type f -exec cat {} \;" In my initial investigation, I ran the texture-gather tests in loops on all cores simultaneously, and couldn't make it fail. Running the full piglit suite triggered the hang maybe 40% of the time. I can reproduce this about %50 of the time running a full piglit suite on SKL as well in a single channel configuration, so I'm going to change this to all hardware. GFX CI fi-skl-6700k2 has now one DIMM at one channel, for 8GB memory. Change happened after CI_DRM_3760. Tomi: so far, we have been reproducing this with 2 sticks of RAM in the same channel. Martin: is it possible to bisect this with EZBench? (In reply to Mark Janes from comment #8) > Martin: is it possible to bisect this with EZBench? It is possible, but requires to write a custom bisecting job which anyway would amount to what git bisect is doing, so there is no real point in using ezbench for that :s I am absolutely swamped with the 3x4K bug and the deployment of cibuglog-ng, so not sure I can help you. If you have a clear reproducing case (run one test even 10 times), then no development is necessary and I can reproduce that. I tested this on two available platforms: Haswell and Kabylake. For Haswell I did 3 full piglit runs and didn't reproduce any GPU hangs (4.15.3-041503-generic was used for testing). But for Kabylake with the same 4.15.3 kernel I got following error (100% reproducible): fail: spec/glsl-1.20/execution/uniform-initializer/fs-mat2-array running: spec/glsl-1.20/execution/uniform-initializer/vs-mat3-set-by-api running: spec/glsl-1.20/execution/uniform-initializer/fs-bool-from-const Traceback (most recent call last):3, warn: 2, fail: 70, crash: 3 -||/ File "/home/vadym/mesa/piglit_original/piglit/framework/test/base.py", line 205, in execute self.run() File "/home/vadym/mesa/piglit_original/piglit/framework/test/base.py", line 279, in run self._run_command() File "/home/vadym/mesa/piglit_original/piglit/framework/test/base.py", line 357, in _run_command Traceback (most recent call last): File "/home/vadym/mesa/piglit_original/piglit/framework/test/base.py", line 205, in execute self.run() File "/home/vadym/mesa/piglit_original/piglit/framework/test/base.py", line 279, in run raise e self._run_command() File "/home/vadym/mesa/piglit_original/piglit/framework/test/base.py", line 357, in _run_command OSError: [Errno 12] Cannot allocate memory raise e OSError: [Errno 12] Cannot allocate memory fail: spec/glsl-1.20/execution/uniform-initializer/vs-mat3-set-by-api fail: spec/glsl-1.20/execution/uniform-initializer/fs-bool-from-const running: spec/glsl-1.20/execution/uniform-initializer/fs-float-set-by-other-stage running: spec/glsl-1.20/execution/uniform-initializer/fs-mat4 Killed/54272] skip: 178, pass: 5853, warn: 2, fail: 72, crash: 3 -|-\| I noticed that at this point no free memory are available on my laptop (~16 GB are allocated).I'm getting this error with the swap area disabled. But with the swap area enabled Linux is completely hangs and only reset can help. With the kernel 4.9 this issue is not reproducible. Kabylake configs: Platform: Lenovo YOGA 520 CPU: Intel® Core™ i7-8550U CPU @ 1.80GHz × 8 GPU: Intel® UHD Graphics 620 (Kabylake GT2) RAM: 16GB OS: Ubuntu 16.04 LTS 64-bit Mesa: 18.1.0-devel (git-fa901768a4) Kernel: 4.15.3-041503-generic Piglit: git-4210d072f Also observed intermittent test failures for KHR-GLES31.core.tessellation_shader.tessellation_control_to_tessellation_evaluation.gl_tessLevel on HSW. When compared to the HSW systems that did do not show this intermittent failure, the main difference is that the failing systems have 1 DIMM installed (8GB in channel A bank 0) and the non-failing systems have 2 DIMMs installed (4GB each in channel A & B bank 0). In this case, the failing systems are running kernel 4.9, and the non-failing systems are running either 4.9 or 4.15. Platform: HP Z220 SFF Workstation SKU: ASJ45AV CPU: i5-3470 @ 3.1 GHz x 4, stepping: 000306A9 00000019 RAM: DDR3 1600MHz (used several options in 2 channels and 4 dimms (1 or 2 memunits x 2Gb)) System BIOS: K51 v01.68 Firmware ver: 8.0.4.1441 OS: Ubuntu 16.04(.4) LTS 64-bit Mesa: 17.2.8 Kernels used: 4.9.x, 4.13.x, 4.15.x Piglit: git-b8e7cc0e59 No GPU hangs were reproduced and reported. There were many combinations in different slots with different memory units. Used the “all” option in the piglit runs. First of all. Sorry about spam. This is mass update for our bugs. Sorry if you feel this annoying but with this trying to understand if bug still valid or not. If bug investigation still in progress, please ignore this and I apologize! If you think this is not anymore valid, please comment to the bug that can be closed. If you haven't tested with our latest pre-upstream tree(drm-tip), can you do that also to see if issue is valid there still and if you cannot see issue there, please comment to the bug. yes, this bug is still valid. Mika, Chris, any advice here? We could try to repro on fi-skl-6700k2 Mika, do you if tried on that ? Can you please apply https://patchwork.freedesktop.org/series/42867/ and see if that makes a difference? Chris, I had to build 4.16 in order for your patch to apply (it doesn't apply to 4.15, which was used originally to hit this issue). But I cannot reproduce the issue on a vanilla 4.16 kernel *without* your patch. 10 full runs of piglit did *not* cause a gpu hang on a skl system running with single channel RAM config. If I roll back to 4.15, I can reproduce the gpu hang on the same system after 1-2 piglit runs. Based on my testing, it seems like this issue is magically resolved by some change in 4.16 from 4.15 (where the issue is hit). Based on comments, resolving. Please reopen if occurs again. Likely fix: commit ba74cb10c775c839f6e1d0fabd1e772eabd9c43f Author: Michel Thierry <michel.thierry@intel.com> Date: Mon Nov 20 12:34:58 2017 +0000 drm/i915/execlists: Delay writing to ELSP until HW has processed the previous write The hardware needs some time to process the information received in the ExecList Submission Port, and expects us to not write anything more until it has 'acknowledged' this new submission by sending an IDLE_ACTIVE or PREEMPTED CSB event. If we do not follow this, the driver could write new data into the ELSP before HW had finishing fetching the previous one, putting us in 'undefined behaviour' space. This seems to be the problem causing the spurious PREEMPTED & COMPLETE events after a COMPLETE like the one below: [] vcs0: sw rd pointer = 2, hw wr pointer = 0, current 'head' = 3. [] vcs0: Execlist CSB[0]: 0x00000018 _ 0x00000007 [] vcs0: Execlist CSB[1]: 0x00000001 _ 0x00000000 [] vcs0: Execlist CSB[2]: 0x00000018 _ 0x00000007 <<< COMPLETE [] vcs0: Execlist CSB[3]: 0x00000012 _ 0x00000007 <<< PREEMPTED & COMPLETE [] vcs0: Execlist CSB[4]: 0x00008002 _ 0x00000006 [] vcs0: Execlist CSB[5]: 0x00000014 _ 0x00000006 The ELSP writes that lead to this CSB sequence show that the HW hadn't started executing the previous execlist (the one with only ctx 0x6) by the time the new one was submitted; this is a bit more clear in the data show in the EXECLIST_STATUS register at the time of the ELSP write. [] vcs0: ELSP[0] = 0x0_0 [execlist1] - status_reg = 0x0_302 [] vcs0: ELSP[1] = 0x6_fedb2119 [execlist0] - status_reg = 0x0_8302 [] vcs0: ELSP[2] = 0x7_fedaf119 [execlist1] - status_reg = 0x0_8308 [] vcs0: ELSP[3] = 0x6_fedb2119 [execlist0] - status_reg = 0x7_8308 Note that having to wait for this ack does not disable lite-restores, although it may reduce their numbers. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=102035 Signed-off-by: Michel Thierry <michel.thierry@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/<20171118003038.7935-1-michel.thierry@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20171120123458.23242-4-chris@chris-wilson.co.uk Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> Tested-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> which was asked to be tested in c3. Well, that explains why it wouldn't reproduce yesterday on 4.16, when we went to test Chris's patch. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.