Created attachment 129334 [details]
dmesg
The request list doesn't match hardware state -- please try with a later kernel, though a fix for a problem you may be hitting hasn't landed in upstream yet, so please try drm-tip. Ok, I'm installing the 2017-02-09 build from http://kernel.ubuntu.com/~kernel-ppa/mainline/drm-tip/current/. I got some warnings on installation: W: Possible missing firmware /lib/firmware/i915/kbl_dmc_ver1_01.bin for module i915 W: Possible missing firmware /lib/firmware/i915/glk_dmc_ver1_01.bin for module i915 W: Possible missing firmware /lib/firmware/i915/kbl_guc_ver9_14.bin for module i915 W: Possible missing firmware /lib/firmware/i915/bxt_guc_ver8_7.bin for module i915 W: Possible missing firmware /lib/firmware/i915/kbl_huc_ver02_00_1810.bin for module i915 W: Possible missing firmware /lib/firmware/i915/bxt_huc_ver01_07_1398.bin for module i915 W: Possible missing firmware /lib/firmware/i915/skl_huc_ver01_07_1398.bin for module i915 but I have no idea if they're relevant or are useful. The 4.10.0-994-generic kernel seems initially to work very well; I can't see any errors in dmesg about i915, unlike with Ubuntu's 4.4.0 kernels, and the TTYs actually display text, which they haven't for a while. If this issue is still present with this kernel, it's likely to be a few days/weeks before I run into it, so please don't close this bug yet. Looks like I spoke too soon -- I just got a crash, much worse than previous ones. I was actively using the machine, not suspending/unsuspending it, and the GUI totally froze except the mouse. Switching to a TTY worked though, and I captured dmesg and a GPU crash dump; uploading those here. Created attachment 129457 [details]
dmesg from 4.10.0-994 kernel
Created attachment 129458 [details]
GPU crash dump from 4.10.0-994 kernel
Baffling. Looks like the same issue, requests are being retired before their seqno is complete and objects reused before they are idle. That should not be possible! It's fair to say that progress at this point will mean you compiling your own kernel from https://cgit.freedesktop.org/drm-tip and then we start trying debug patches. Ok, I can probably do that. Having never built the kernel before, where do you suggest I get the config from -- the Ubuntu 4.10 drm-tip build? Yes, cp /boot/config-`uname -r` .config is a good starting point - it will then at least boot :) I've built and booted a kernel from drm-tip (revision fb21519ea). Unfortunately, the initrd is huge (almost as big as /boot) which makes it a little hard to deal with. Will using INSTALL_MOD_STRIP=1 make debugging harder? I think something is wrong with the mkinitramfs script then, I believe it should (or at least can?) only include the modules required for booting. Cross your fingers and try make localmodconfig. It only very rarely fails to include a module actually used for booting - usually because the delta between the base distro config and your own is too great. But in this case you have a v4.10 config so should be fine. make localmodconfig worked fine and reduced the size of initrd to about a tenth of its original size, thanks for mentioning it! Seems like I now have a working kernel from source. Cool, next step is wait for a hang and attach it. That's just to be sure we are still reproducing the same issue with a local build. I got a hang, uploading logs. Created attachment 129493 [details]
dmesg from drm-tip fb21519ea
Created attachment 129494 [details]
GPU crash dump from drm-tip fb21519ea
Hmm, still the same :| Apply this patch: diff --git a/drivers/gpu/drm/i915/i915_gem.h b/drivers/gpu/drm/i915/i915_gem.h index 5a49487368ca..998e3780f2c6 100644 --- a/drivers/gpu/drm/i915/i915_gem.h +++ b/drivers/gpu/drm/i915/i915_gem.h @@ -26,7 +26,7 @@ #define __I915_GEM_H__ #ifdef CONFIG_DRM_I915_DEBUG_GEM -#define GEM_BUG_ON(expr) BUG_ON(expr) +#define GEM_BUG_ON(expr) WARN_ON(expr) #define GEM_WARN_ON(expr) WARN_ON(expr) #define GEM_DEBUG_DECL(var) var and please recompile with CONFIG_DRM_I915_DEBUG_GEM. If you using make menuconfig, look under Device Drivers / Graphics / i915 debugging options. You basically need to enable all options there, which also requires setting CONFIG_EXPERT under General Settings. I got a crash, although I couldn't get the GPU crash dump unfortunately -- I couldn't even switch to a TTY. dmesg looks like it might be more useful now though. Should I be staying on revision fb21519ea for this, or should I be upgrading to the latest drm-tip? Created attachment 129517 [details]
dmesg from fb21519ea with patch from #19
(In reply to Josh Holland from comment #20) > I got a crash, although I couldn't get the GPU crash dump unfortunately -- I > couldn't even switch to a TTY. dmesg looks like it might be more useful now > though. Hmm, after the hang though and the crash is consistent with the corruption. But it does make the matter much more serious as we go from a gpu hang to a driver lockup. I was expecting the debug code to detect something much, much earlier. Hmm. Or maybe it is more significant than I first thought. Feb 11 17:47:37 yes kernel: [ 2.358719] [drm] DRM_I915_DEBUG_GEM enabled confirms that the debug code was indeed enabled. > Should I be staying on revision fb21519ea for this, or should I be upgrading > to the latest drm-tip? Nothing pertinent to this bug yet, but refreshing everytime we have an idea to test will be helpful (random bug fixes hopefully improving and not adding regressions elsewhere!). commit fe3288b5da2c1286a7aac1fb1b2234caa752a81b Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Sun Feb 12 17:20:01 2017 +0000 drm/i915: Park the breadcrumbs signaler across a GPU reset should fix the bug on hit in comment 19. Could you update drm-tip (git stash; git reset --hard <drm-tip>; git stash apply) and see what pops out of the woodwork this time? * still hoping for a nice WARN or a sensible error-state! Ok, now building from 58294e406 with the patch from comment 19. And another hang. This one I got dmesg and GPU dump from; it seemed more like the hangs I was getting on the Ubuntu 4.4 kernels (where everything apart from the cursor freezes for a minute, but then starts working fine again), rather than the 4.10 ones I've been building, where the entire UI (including cursor) freezes and doesn't seem to recover. Presumably your commit fixed something... Created attachment 129567 [details]
dmesg from 58294e406 with patch from #19
Created attachment 129568 [details]
GPU crash dump from 58294e406 with patch from #19
(In reply to Josh Holland from comment #25) > And another hang. This one I got dmesg and GPU dump from; it seemed more > like the hangs I was getting on the Ubuntu 4.4 kernels (where everything > apart from the cursor freezes for a minute, but then starts working fine > again), rather than the 4.10 ones I've been building, where the entire UI > (including cursor) freezes and doesn't seem to recover. Presumably your > commit fixed something... Yup, we are right back to the original pattern of hangs. But now I know it passed internal sanity checks before doing so. Puzzling. I need to think about how this could even arise, in the meantime could you please try running with i915.semaphores=0 on the kernel command line and see if that makes a difference? Booted with i915.semaphores=0 (still on drm-tip 58294e406 with patch from #19), then got a 4.10-like hang where even Magic SysRq stopped working. I can upload the whole dmesg if you want, but I assume this is the relevant part, ending with the final message (the previous message was half an hour ago when bamfdaemon died, which it does occasionally): Feb 14 16:51:03 yes kernel: [ 3435.777321] [drm] GPU HANG: ecode 6:0:0xbde7ffff, in compiz [3786], reason: Hang on render ring, action: reset Feb 14 16:51:03 yes kernel: [ 3435.777323] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. -- ...how to report GPU hangs... -- Feb 14 16:51:03 yes kernel: [ 3435.777324] [drm] GPU crash dump saved to /sys/class/drm/card0/error Feb 14 16:51:03 yes kernel: [ 3435.777375] drm/i915: Resetting chip after gpu hang Feb 14 16:51:06 yes kernel: [ 3438.780562] asynchronous wait on fence i915:[global]:3b94e timed out Feb 14 16:51:11 yes kernel: [ 3443.772358] drm/i915: Resetting chip after gpu hang Feb 14 16:51:23 yes kernel: [ 3455.931636] [drm:drm_atomic_helper_swap_state [drm_kms_helper]] *ERROR* [CRTC:31:pipe A] hw_done timed out Feb 14 16:51:24 yes kernel: [ 3456.955546] asynchronous wait on fence i915:[global]:3b954 timed out Feb 14 16:51:34 yes kernel: [ 3466.171061] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:31:pipe A] hw_done timed out Feb 14 16:51:35 yes kernel: [ 3467.195005] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:31:pipe A] hw_done timed out Unfortunately no GPU dump, due to the aforementioned lack of response to even an alt-SysRq-B, let alone ctrl-alt-F1. Can you try: diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c index 4ffa35faff49..df094699ba9d 100644 --- a/drivers/gpu/drm/i915/intel_ringbuffer.c +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c @@ -1217,10 +1217,6 @@ static int init_status_page(struct intel_engine_cs *engine) return PTR_ERR(obj); } - ret = i915_gem_object_set_cache_level(obj, I915_CACHE_LLC); - if (ret) - goto err; - vma = i915_vma_instance(obj, &engine->i915->ggtt.base, NULL); if (IS_ERR(vma)) { ret = PTR_ERR(vma); @@ -1244,7 +1240,7 @@ static int init_status_page(struct intel_engine_cs *engine) if (ret) goto err; - vaddr = i915_gem_object_pin_map(obj, I915_MAP_WB); + vaddr = i915_gem_object_pin_map(obj, I915_MAP_WC); if (IS_ERR(vaddr)) { ret = PTR_ERR(vaddr); goto err_unpin; Ok, built drm-tip 1d7915e78 with patches from #19 and #30, then got several hangs in a row. Attaching the last dmesg and GPU dump; dmesg is longer, and the GPU dump is barely different (only timestamps changed between hangs). Is there anything useful I can be giving you besides dmesg and /sys/class/drm/card0/error? Created attachment 130055 [details]
dmesg from 1d7915e78 with patches #19 and #30
Created attachment 130056 [details]
GPU dump from 1d7915e78 with patches #19 and #30
dmesg & error are just what I need. (If I need anything else, the goal is to add it to the error state.) Onto the next theory, a few more asserts: diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c index 4ffa35f..5a7c140 100644 --- a/drivers/gpu/drm/i915/intel_ringbuffer.c +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c @@ -44,6 +44,7 @@ static int __intel_ring_space(int head, int tail, int size) int space = head - tail; if (space <= 0) space += size; + GEM_BUG_ON(space <= I915_RING_FREE_SPACE); return space - I915_RING_FREE_SPACE; } @@ -1682,6 +1683,8 @@ u32 *intel_ring_begin(struct drm_i915_gem_request *req, int num_dwords) wait_bytes = total_bytes; } + GEM_BUG_ON(ring->space > __intel_ring_space(ring->head & HEAD_ADDR, + ring->tail, ring->size)); if (wait_bytes > ring->space) { int ret = wait_for_space(req, wait_bytes); if (unlikely(ret)) @@ -1698,6 +1701,7 @@ u32 *intel_ring_begin(struct drm_i915_gem_request *req, int num_dwords) ring->space -= remain_actual; } + GEM_BUG_ON(bytes > ring->space); GEM_BUG_ON(ring->tail > ring->size - bytes); cs = ring->vaddr + ring->tail; ring->tail += bytes; Built from ec496685b with patches #19, #30 and #34, and got a hang where Compiz exploded. Is something munging the tabs in your patches, by the way? I've had to apply the last two by hand. Created attachment 130066 [details]
dmesg from ec496685b, patches 19, 30, 34
Created attachment 130067 [details]
GPU dump from ec496685b, patches 19, 30, 34
(In reply to Josh Holland from comment #35) > Is something munging the tabs in your patches, by the way? I've had to apply > the last two by hand. Lazily pasting from a terminal that likes to expand tabs into the clipboard. The pattern is that the ring is running past the TAIL. Now I need to find an explanation that doesn't involve the hw playing games with us. Please try (pardon the tabs): diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c index 4ffa35faff49..62e31a7438ac 100644 --- a/drivers/gpu/drm/i915/intel_ringbuffer.c +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c @@ -782,10 +782,10 @@ static void i9xx_submit_request(struct drm_i915_gem_request *request) { struct drm_i915_private *dev_priv = request->i915; - i915_gem_request_submit(request); - GEM_BUG_ON(!IS_ALIGNED(request->tail, 8)); I915_WRITE_TAIL(request->engine, request->tail); + + i915_gem_request_submit(request); } static void i9xx_emit_breadcrumb(struct drm_i915_gem_request *req, u32 *cs) Reference to Chris' patch: https://patchwork.freedesktop.org/series/20757/ The patch is bogus unfortunately. Still trying to find an explanation. *** Bug 100110 has been marked as a duplicate of this bug. *** Just documenting my continual failure here. Having run with DEBUG_GEM enabled, you also showed that this assert doesn't fire: void __i915_gem_request_submit(struct drm_i915_gem_request *request) { GEM_BUG_ON(i915_seqno_passed(intel_engine_get_seqno(engine), seqno)); } which is the scenario I was worrying about in comment 39. I still need to write a test to see if the tail write goes backwards. Test to see if we ever write requests out-of-order: diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c index be908e2a52ea..da610ce176a9 100644 --- a/drivers/gpu/drm/i915/intel_ringbuffer.c +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c @@ -784,6 +784,16 @@ static void i9xx_submit_request(struct drm_i915_gem_request *request) i915_gem_request_submit(request); + { + u32 head = I915_READ_HEAD(request->engine) & HEAD_ADDR; + u32 tail = I915_READ_TAIL(request->engine) & HEAD_ADDR; + int prev = __intel_ring_space(tail, head, request->ring->size); + int next = __intel_ring_space(request->tail, head, request->ring->size); + WARN(head != tail && next <= prev, + "Bacwards we go: head=%x, tail=%x, next=%x\n", + head, tail, request->tail); + } + GEM_BUG_ON(!IS_ALIGNED(request->tail, 8)); I915_WRITE_TAIL(request->engine, request->tail); } Created attachment 130303 [details] [review] Test to see if TAIL writes go backwards Chris, thank you on working to stabilize snb. I appreciate it. Built from d8839e27a with all patches except comment 39, got a hang. Created attachment 130328 [details]
dmesg from d8839e27a
Created attachment 130329 [details]
GPU dump from d8839e27a
(In reply to Josh Holland from comment #49) > Created attachment 130329 [details] > GPU dump from d8839e27a The HEAD is past the TAIL, but the seqno is old. The batch buffer address is consistent with the instructions in the ring at HEAD. The oddity this time is that the seqno do not match up with the supposed execution through the ring. Very minor, but commit fe085f13c7901203445fd2ab26c0f499313b8258 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Mar 21 10:25:52 2017 +0000 drm/i915: Remove intel_ring.last_retired_head may help clarify the expected values in the error state. Hmm, just seen some similar symptoms in bug 100484 where HEAD > TAIL and seqno stopped updating long before. The tale tell is the context was blank and the failure occurred just after it was reloaded. (In reply to Chris Wilson from comment #52) > Hmm, just seen some similar symptoms in bug 100484 where HEAD > TAIL and > seqno stopped updating long before. The tale tell is the context was blank > and the failure occurred just after it was reloaded. Sadly that doesn't appear to be the case here, the contexts here seem to have content (so hopefully valid content!) *** Bug 100454 has been marked as a duplicate of this bug. *** Can you please test with https://patchwork.freedesktop.org/patch/154241/ and GEM debugging enabled? It's a very rare possibility that we may have placed the RING_TAIL on the same cacheline as RING_HEAD. Will do. It looks like that patch is already in drm-tip fb550f864? I'm also assuming previous patches to drivers/gpu/drm/i915/intel_ringbuffer.c are no longer relevant, since they don't apply on top of current drm-tip. Got a hang (drm-tip d6a919d39), this time not in Chrome. I had the following DRM-related kernel config (gathered from previous comments and posted here for the next time I accidentally delete .config), hopefully I didn't leave anything important turned off. # # drm/i915 Debugging # CONFIG_DRM_I915_WERROR=y CONFIG_DRM_I915_DEBUG=y CONFIG_DRM_I915_DEBUG_GEM=y CONFIG_DRM_I915_SW_FENCE_DEBUG_OBJECTS=y # CONFIG_DRM_I915_SW_FENCE_CHECK_DAG is not set CONFIG_DRM_I915_SELFTEST=y # CONFIG_DRM_I915_LOW_LEVEL_TRACEPOINTS is not set # CONFIG_DRM_I915_DEBUG_VBLANK_EVADE is not set Created attachment 131468 [details]
dmesg from d6a919d39
Created attachment 131469 [details]
GPU dump from d6a919d39
(In reply to Josh Holland from comment #57) > Got a hang (drm-tip d6a919d39), this time not in Chrome. I had the following > DRM-related kernel config (gathered from previous comments and posted here > for the next time I accidentally delete .config), hopefully I didn't leave > anything important turned off. I think that's actually a genuine userspace hang (in mesa), checking the location of the RING_HEAD is consistent with our expectation and it doesn't seem to have the same stray ACTHD or wacky retirements as earlier. (Still slightly cautious as it took a while to spot the strange behaviour originally, and I may be missing it here.) I've been running Mesa 17.1 (rather than Ubuntu Xenial's Mesa 12) for nearly three weeks now, and I'm inclined to agree -- I still have graphical issues with stuff flickering, especially in Chrome, and dmesg still has the occasional "Atomic update failure on pipe A", but I haven't had a single GPU hang on Mesa 17 AFAIR. Scratch that, I switched back to Ubuntu's kernel (4.4.0-79) from drm-tip and I got a hang after a week. Uploading error state for completeness' sake. Created attachment 132284 [details]
dmesg (Ubuntu kernel 4.4.0-79, Mesa 17.1.2)
Created attachment 132285 [details]
GPU dump (Ubuntu kernel 4.4.0-79, Mesa 17.1.2)
*** Bug 102120 has been marked as a duplicate of this bug. *** *** Bug 103407 has been marked as a duplicate of this bug. *** Hello, As indicated above, my bug report 103407 (also GPU hang on latest opensuse tumbleweed and lenovo x201 and x220) is declared a duplicate of this bug here. reading above, I see that the issues still is open and i'd like to ask here whether there is anything we can do about it? TIA (In reply to dev66 from comment #67) > reading above, I see that the issues still is open and i'd like to ask here > whether there is anything we can do about it? > > TIA Unfortunately seems not, I am also still getting this occasionally. It does now only seem to happen to me when the machine is under memory pressure/high cpu temperatures, so I wonder if it could be a hardware problem. First of all. Sorry about spam. This is mass update for our bugs. Sorry if you feel this annoying but with this trying to understand if bug still valid or not. If bug investigation still in progress, please ignore this and I apologize! If you think this is not anymore valid, please comment to the bug that can be closed. If you haven't tested with our latest pre-upstream tree(drm-tip), can you do that also to see if issue is valid there still and if you cannot see issue there, please comment to the bug. Closing, please re-open is issue still exists. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.
Created attachment 129333 [details] GPU crash dump Randomly when waking from suspend, I get graphical issues, with this in dmesg: [36834.014792] [drm] stuck on render ring [36834.015296] [drm] GPU HANG: ecode 6:0:0xbd69ffff, in compiz [2609], reason: Ring hung, action: reset [36834.015299] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. [36834.015300] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel [36834.015301] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue. [36834.015302] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it. [36834.015303] [drm] GPU crash dump saved to /sys/class/drm/card0/error [36834.017384] drm/i915: Resetting chip after gpu hang [36840.026991] [drm] stuck on render ring [36840.027509] [drm] GPU HANG: ecode 6:0:0xfeffffff, in compiz [2609], reason: Ring hung, action: reset [36840.029611] drm/i915: Resetting chip after gpu hang The graphical glitches vary; mostly, all my windows just get moved to one workspace. This time, large amounts of text in some applications disappeared -- very little text is visible in gnome-system-monitor and nautilus (even when closed and reopened), but Chrome and gnome-terminal aren't affected. I'm running Ubuntu 16.04, Linux yes 4.4.0-62-generic #83-Ubuntu SMP Wed Jan 18 14:10:15 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux, on a Lenovo G580 laptop.