Created attachment 69799 [details] dmesg of the crash When my laptop is under heavy load (compile, rsync, ...) the screen goes black. And in the dmesg a lot of messages in the form of: [ 528.932020] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung [ 528.932025] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state [ 529.028586] ------------[ cut here ]------------ [ 529.028598] WARNING: at drivers/gpu/drm/i915/intel_display.c:1049 intel_enable_pipe+0x160/0x1b0() [ 529.028600] Hardware name: HP Compaq 6910p (GB950EA#UUG) [ 529.028601] PLL state assertion failure (expected on, current off) [ 529.028603] Modules linked in: tun i2c_i801 acpi_cpufreq mperf hid_generic usbhid hid arc4 snd_hda_codec_analog snd_hda_intel snd_hda_codec 8250_pci kvm_intel iwl4965 snd_pcm iwlegacy snd_page_alloc mac80211 snd_timer hp_accel 8250_core lis3lv02d e1000e cfg80211 kvm snd hp_wmi serial_core battery psmouse sr_mod cdrom input_polldev sparse_keymap uhci_hcd ac wmi [ 529.028632] Pid: 2605, comm: upowerd Not tainted 3.7.0-rc2-00008-g0390c88 #1 [ 529.028634] Call Trace: [ 529.028640] [<c10306d8>] ? warn_slowpath_common+0x78/0xb0 [ 529.028643] [<c125c1e0>] ? intel_enable_pipe+0x160/0x1b0 [ 529.028645] [<c125c1e0>] ? intel_enable_pipe+0x160/0x1b0 [ 529.028648] [<c10307a3>] ? warn_slowpath_fmt+0x33/0x40 [ 529.028650] [<c125c1e0>] ? intel_enable_pipe+0x160/0x1b0 [ 529.028653] [<c125ebf0>] ? i9xx_crtc_mode_set+0xc40/0x1240 [ 529.028656] [<c12635b5>] ? intel_set_mode+0x525/0x870 [ 529.028661] [<c1264820>] ? intel_get_load_detect_pipe+0x2b0/0x3a0 [ 529.028665] [<c12f1d08>] ? bit_xfer+0x178/0x4c0 [ 529.028670] [<c10eebf4>] ? __d_instantiate_unique+0xe4/0x130 [ 529.028673] [<c10d36a5>] ? kmem_cache_alloc+0x55/0xa0 [ 529.028677] [<c112d0e9>] ? sysfs_open_file+0x179/0x240 [ 529.028680] [<c10a9613>] ? prep_new_page+0x113/0x1d0 [ 529.028685] [<c127ca20>] ? intel_tv_detect+0x80/0x3f0 [ 529.028688] [<c10a9843>] ? get_page_from_freelist+0x173/0x3e0 [ 529.028693] [<c1230e65>] ? status_show+0x35/0x80 [ 529.028696] [<c1230e30>] ? dpms_show+0x50/0x50 [ 529.028700] [<c128a298>] ? dev_attr_show+0x18/0x50 [ 529.028702] [<c112d237>] ? sysfs_read_file+0x87/0x140 [ 529.028705] [<c10dae55>] ? do_sys_open+0x165/0x1c0 [ 529.028708] [<c112d1b0>] ? sysfs_open_file+0x240/0x240 [ 529.028710] [<c10dba5b>] ? vfs_read+0x8b/0x130 [ 529.028713] [<c10dbb4a>] ? sys_read+0x4a/0x90 [ 529.028717] [<c13c55fa>] ? sysenter_do_call+0x12/0x22 [ 529.028719] ---[ end trace 1562ac833b2f8043 ]--- [ 529.440024] [drm:i915_reset] *ERROR* Failed to reset chip. [ 529.445880] ------------[ cut here ]------------ [ 529.445887] WARNING: at drivers/gpu/drm/i915/intel_display.c:1049 intel_enable_pipe+0x160/0x1b0() [ 529.445889] Hardware name: HP Compaq 6910p (GB950EA#UUG) [ 529.445890] PLL state assertion failure (expected on, current off) [ 529.445892] Modules linked in: tun i2c_i801 acpi_cpufreq mperf hid_generic usbhid hid arc4 snd_hda_codec_analog snd_hda_intel snd_hda_codec 8250_pci kvm_intel iwl4965 snd_pcm iwlegacy snd_page_alloc mac80211 snd_timer hp_accel 8250_core lis3lv02d e1000e cfg80211 kvm snd hp_wmi serial_core battery psmouse sr_mod cdrom input_polldev sparse_keymap uhci_hcd ac wmi [ 529.445919] Pid: 2605, comm: upowerd Tainted: G W 3.7.0-rc2-00008-g0390c88 #1 [ 529.445920] Call Trace: Nothing can't resurrect the screen (switch to console, kill of X), but the machine is still responding on the ssh so I was able to take dmesg, i915_error_state, meminfo, mtrr, slabinfo, swaps, vmallocinfo, vmstat, zoneinfo if any is of need. I attach the dmesg and the i915_error_state Last kernel used with the same problem is 3.7-rc4
Created attachment 69800 [details] i915_error_state after the crash
The hang is reminiscent of bug 55984.
How quickly can you reproduce this? If you can hit this easily, can you please attempt a bisect to root-cause the commit that introduced the problem for you?
Can you also please give us your exact mesa version?
Also: Do you have any swap partition enabled?
I can reproduce it easily,so will attempt a bisect. I have a 2G swap enabled and my mesa version is (I use gentoo with their x11 overlay) : for mesa : snb-magic-12553-gb534c39 for drm : libdrm-2.4.39-16-g14db948 for xf86-video-intel : 2.20.9-43-gfb5205a (with uxa, no sna) If you want me to try other versions, no problem.
Ok, two things for you to test please: - Can you please test Chris' fastboot branch from http://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=fastboot Despite it's name it also contains some trickery with memory barrier which might help here. - Our QA discovered a random corruption issue (bug #56859) and bisected it to commit 7f1290f2f2a4d2c3f1b7ce8e87256e052ca23125 Author: Jianguo Wu <wujianguo@huawei.com> Date: Mon Oct 8 16:33:06 2012 -0700 mm: fix-up zone present pages Can you please test whether reverting that commit changes anything?
I tried with the reverted commit, but the problem is still here. I'm still bisecting. I did it once but don't think I was correct in doing it because it gave me a merge commit as first bad one : commit 9db908806b85c1430150fbafe269a7b21b07d15d Merge: 4d7127d 72f36d5 Author: Linus Torvalds <torvalds@linux-foundation.org> Date: Sat Oct 13 13:22:01 2012 -0700 Merge tag 'md-3.7' of git://neil.brown.name/md so restarting it. but before, i'll test the fastboot branch and report.
just to be sure that this is what I have to test; I did: git clone http://cgit.freedesktop.org/~ickle/linux-2.6/ -b fastboot fastboot and have : v2.6.32-rc1-168511-g7da6bfc Is that OK ?
(In reply to comment #9) > just to be sure that this is what I have to test; I did: > > git clone http://cgit.freedesktop.org/~ickle/linux-2.6/ -b fastboot fastboot > > and have : > > v2.6.32-rc1-168511-g7da6bfc > > Is that OK ? That's commit 7da6bfcd589270bcd35bfcf0b029403c52e5ad06 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Nov 13 11:43:29 2012 +0000 drm/i915: Only preserve the BIOS modes if they are the preferred ones which is the tip of fastboot, so it should be fine. Just you are lacking a few tags. :)
I tested the fastboot branch and it seems stable. Usually after several minutes, it crashes but here not yet. Do you still want me to finish my bisect or try something else ?
and of course just after pushing the send button, it crashed :-S so the fastboot has the problem too. sorry for being too quick to answer
Created attachment 70113 [details] [review] disable unbound tracking Silly me just noticed that the unbound tracking has been merged into 3.7, not 3.6. This has a big enough impact to explain all kinds of things. Please try the attached patch, thanks.
I tested it on a 3.7-rc4 (?) kernel, without success.
Created attachment 70169 [details] [review] disable cpu relocs completely I'm not completely sure, but I think we haven't ruled this one out yet. Please test, thanks
For reference, please attach a full dmesg, thanks.
Ping for dmesg - we have similar reports spanning a few different platforms, and we're trying to hunt down common patterns. Kernel version really doesn't matter.
Sorry for the late answer. I tested the patch with the same result. I attach the dmesg from this boot.
Created attachment 70247 [details] dmesg from crashed 3.7-rc4 with "disable cpu relocs completely" patch
Ok, yet another new theory ... please attach your kernel .config, thanks.
Created attachment 70268 [details] config file for the 3.7 kernel
Can you please try the tree from http://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=bug55984 and see if that improves matters?
Well, after a 3 hours uptime, it didn't yet crash. So it is more stable. So for me, it's the first 3.7 good kernel. Thanks !
(In reply to comment #23) > Well, after a 3 hours uptime, it didn't yet crash. So it is more stable. > So for me, it's the first 3.7 good kernel. Thanks ! So that I know which of the many branches I labeled as bug55984 today, can you please tell me which commit you are running? Thanks.
> git describe v2.6.32-rc1-157061-g966339d > uname -r 3.6.0-rc7-157061-g966339d I hope it's really a 3.7 ;-)
Oh noes, wrong branch... Sorry. Interestingly though that is my master branch from just before the merge with 3.7-rc2, so it still has all of the contentious features. However, to focus on the present, presuming you did something like: $ git remote add ickle -f git://people.freedesktop.org/~ickle/linux-2.6 You want to do a $ git checkout -b bug56916 ickle/bug59844 build, install, test.
Just to be sure: > git remote add ickle -f git://people.freedesktop.org/~ickle/linux-2.6 Updating ickle remote: Counting objects: 22446, done. remote: Compressing objects: 100% (7007/7007), done. remote: Total 21006 (delta 16562), reused 18081 (delta 13994) Receiving objects: 100% (21006/21006), 3.77 MiB | 175 KiB/s, done. Resolving deltas: 100% (16562/16562), completed with 653 local objects. From git://people.freedesktop.org/~ickle/linux-2.6 * [new branch] 2.6.38 -> ickle/2.6.38 * [new branch] 845g -> ickle/845g * [new branch] 8xx-cache-coherency -> ickle/8xx-cache-coherency * [new branch] amalgam -> ickle/amalgam * [new branch] async -> ickle/async * [new branch] broken-vm -> ickle/broken-vm * [new branch] bug48652 -> ickle/bug48652 * [new branch] bug55984 -> ickle/bug55984 * [new branch] derrmr -> ickle/derrmr * [new branch] direct-gtt -> ickle/direct-gtt * [new branch] drm-intel-fixes -> ickle/drm-intel-fixes * [new branch] drm-intel-next -> ickle/drm-intel-next * [new branch] drm-intel-testing -> ickle/drm-intel-testing * [new branch] fastboot -> ickle/fastboot * [new branch] fence-pin -> ickle/fence-pin * [new branch] for-airlied -> ickle/for-airlied * [new branch] for-danvet -> ickle/for-danvet * [new branch] for-imre -> ickle/for-imre * [new branch] for-jiri -> ickle/for-jiri * [new branch] gen2-pageflip -> ickle/gen2-pageflip * [new branch] gen3-pageflip -> ickle/gen3-pageflip * [new branch] gtt -> ickle/gtt * [new branch] intel-next -> ickle/intel-next * [new branch] irq-poll -> ickle/irq-poll * [new branch] ivb-vsync -> ickle/ivb-vsync * [new branch] master -> ickle/master * [new branch] next -> ickle/next * [new branch] old-queue -> ickle/old-queue * [new branch] panel-refactor -> ickle/panel-refactor * [new branch] pinleak -> ickle/pinleak * [new branch] ppgtt -> ickle/ppgtt * [new branch] reap-mmap-offsets -> ickle/reap-mmap-offsets * [new branch] remove-pipelining -> ickle/remove-pipelining * [new branch] ring-freq -> ickle/ring-freq * [new branch] scatterlist -> ickle/scatterlist * [new branch] set-cache-level -> ickle/set-cache-level * [new branch] snb -> ickle/snb * [new branch] stolen -> ickle/stolen * [new branch] stutter -> ickle/stutter * [new branch] total-gtt -> ickle/total-gtt * [new branch] unbound -> ickle/unbound * [new branch] unbound-cache -> ickle/unbound-cache * [new branch] upstream -> ickle/upstream * [new branch] vm -> ickle/vm * [new branch] vmap -> ickle/vmap * [new branch] wait-seqno -> ickle/wait-seqno * [new branch] xv-overlay -> ickle/xv-overlay * [new branch] xv-pinleak -> ickle/xv-pinleak > git describe v3.7-rc4 > git checkout -b bug56916 ickle/bug59844 fatal: Cannot update paths and switch to branch 'bug56916' at the same time. Did you intend to checkout 'ickle/bug59844' which can not be resolved as commit? > git checkout -b bug56916 M drivers/gpu/drm/i915/i915_gem_execbuffer.c Switched to a new branch 'bug56916' > git describe v3.7-rc4 > git branch * bug56916 master radeon Is it ok (yes, I'm a new comer to "more advanced git usage") ? If so, I'll build, install, test ;-)
Gah, I meant bug55984. So, $ git checkout bug56916 $ git reset --hard ickle/bug55984
> git checkout bug56916 M drivers/gpu/drm/i915/i915_gem_execbuffer.c Already on 'bug56916' > git reset --hard ickle/bug55984 HEAD is now at 889b020 drm/i915: Avoid forcing relocations through the mappable GTT or CPU > git describe v3.7-rc5-209-g889b020 Hope it's ok now, sorry if I'm "slow" :-)
ok I tested it and the problem is still there.
(In reply to comment #30) > ok I tested it and the problem is still there. That matches the results I found yesterday as well. So far, ickle/for-imre is the only 3.7 branch that is surviving.
Created attachment 71808 [details] [review] make the shrinker less aggressive Duct-tape solution if it is one, but imo very much worth a try.
i will try it asap. I redid several bisects that pointed to bf7ad8eeab995710c766df49c9c69a8592ca0216 is the first bad commit commit bf7ad8eeab995710c766df49c9c69a8592ca0216 Author: Michel Lespinasse <walken@google.com> Date: Mon Oct 8 16:30:37 2012 -0700 rbtree: move some implementation details from rbtree.h to rbtree.c rbtree users must use the documented APIs to manipulate the tree structure. Low-level helpers to manipulate node colors and parenthood are not part of that API, so move them to lib/rbtree.c it seems to not be the culprit but to expose more the bug. The only problem I can see (not a de velopper) is that it changes -static inline void rb_set_parent(struct rb_node *rb, struct rb_node *p) -{ - rb->rb_parent_color = (rb->rb_parent_color & 3) | (unsigned long)p; -} to : +#define rb_color(r) ((r)->__rb_parent_color & 1) ... +static inline void rb_set_parent(struct rb_node *rb, struct rb_node *p) +{ + rb->__rb_parent_color = rb_color(rb) | (unsigned long)p; +} so changing the "& 3" to "& 1". I tried to apply that change to a working kernel but had no crash and reverting it from 3.7 didn't make a stable kernel either.
Hm, that's a very strange bisect - at most this should effect code generation a bit and move a few functions around in the compiled kernel. But we already know that this 3.7 regression is most likely a side-effect of some seemingly unrelated change, which then brings a probably pre-existing bug up.
so far, so good. I tested the 3.7 kernel with this patch and I can still see my screen, so for me this patch may have a Tested-by from my side :-) Thanks
Created attachment 71909 [details] [review] Overallocate fenced regions So, that patch just has the effect of changing the eviction order so that cached bo are no longer preferentially thrown out. All pointing towards a latent bug elsewhere. The error-states in https://bugzilla.redhat.com/show_bug.cgi?id=877461 follow the same pattern as I've observed with invalid surface sizes (an EU is idle waiting for the never-returning sampler, whilst all other EU are busy stalling for the shared resource). So based on that observation, let's attach surface allocation and please try the attached patch for the DDX (UXA).
Also available for testing: https://patchwork.kernel.org/patch/1896161/ If the suggestion is that memory layout and eviction, play a critical row, above at least is one genuine bug that we can fix.
Created attachment 71931 [details] [review] Align surface sizes to an even tile row A slightly more refined patch.
(In reply to comment #36) > Created attachment 71909 [details] [review] [review] > Overallocate fenced regions Do you want me to test it with a crashing 3.7 kernel and a 2.20.16-48-g52fd223 + patch intel driver ? And for the 2 other patches, which one should I test now ? both together, the last one only ?
(In reply to comment #39) > (In reply to comment #36) > > Created attachment 71909 [details] [review] [review] [review] > > Overallocate fenced regions > > Do you want me to test it with a crashing 3.7 kernel and a > 2.20.16-48-g52fd223 + patch intel driver ? > And for the 2 other patches, which one should I test now ? both together, > the last one only ? So far, we have a positive report for combining the xf86-video-intel patch and the kernel eviction fix, that is both of the patches from comment 37 and 38. So please try that combination first.
I tested successfully the intel driver (2.20.16-48-g52fd223) with the patch and a 3.7 kernel without the patch and it seems to be enough for me. Will patch the kernel and retest.
xf86-video-intel commit 736b89504a32239a0c7dfb5961c1b8292dd744bd Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Sun Dec 30 10:32:18 2012 +0000 uxa: Align surface allocations to even tile rows Align surface sizes to an even number of tile rows to cater for sampler prefetch. If we read beyond the last page we may catch the PTE in a state of flux and trigger a GPU hang. Also detected by enabling invalid PTE access checking. References: https://bugs.freedesktop.org/show_bug.cgi?id=56916 References: https://bugs.freedesktop.org/show_bug.cgi?id=55984 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk
Grr, this bug is driving me crazy. I just hit it again today by visiting the web page referenced in http://thread.gmane.org/gmane.comp.video.dri.devel/78328 So I wanted to patch a 3.7 kernel with your patch from #37 but had this compile error: CC drivers/gpu/drm/drm_hashtab.o CC drivers/gpu/drm/drm_mm.o drivers/gpu/drm/drm_mm.c: In function ‘drm_mm_scan_remove_block’: drivers/gpu/drm/drm_mm.c:612:3: erreur: implicit declaration of function ‘__drm_mm_hole_node_end’ [-Werror=implicit-function-declaration] cc1: some warnings being treated as errors make[3]: *** [drivers/gpu/drm/drm_mm.o] Erreur 1 make[2]: *** [drivers/gpu/drm] Erreur 2 make[1]: *** [drivers/gpu] Erreur 2 make: *** [drivers] Erreur 2
Just change the __drm to drm: https://bugs.freedesktop.org/attachment.cgi?id=72022
Just to let you know that today I had the problem again :-( Is there a way I can help you ?
(In reply to comment #45) > Just to let you know that today I had the problem again :-( > Is there a way I can help you ? Was that with any of the patches discussed applied?
(In reply to comment #46) > (In reply to comment #45) > > Just to let you know that today I had the problem again :-( > > Is there a way I can help you ? > > Was that with any of the patches discussed applied? Yes, both the kernel and the intel driver patches were applied.
(In reply to comment #47) > (In reply to comment #46) > > (In reply to comment #45) > > > Just to let you know that today I had the problem again :-( > > > Is there a way I can help you ? > > > > Was that with any of the patches discussed applied? > > Yes, both the kernel and the intel driver patches were applied. Which on of the two kernel patches? "make the shrinker less aggressive" and/or "drm: Only evict the blocks required to create the requested hole"?
(In reply to comment #48) > (In reply to comment #47) > > (In reply to comment #46) > > > (In reply to comment #45) > > > > Just to let you know that today I had the problem again :-( > > > > Is there a way I can help you ? > > > > > > Was that with any of the patches discussed applied? > > > > Yes, both the kernel and the intel driver patches were applied. > > Which on of the two kernel patches? "make the shrinker less aggressive" > and/or "drm: Only evict the blocks required to create the requested hole"? Patchwork drm: Only evict the blocks required to create the requested hole
(In reply to comment #49) > (In reply to comment #48) > > Which on of the two kernel patches? "make the shrinker less aggressive" > > and/or "drm: Only evict the blocks required to create the requested hole"? > > Patchwork drm: Only evict the blocks required to create the requested hole Can you please test the "make shrinker less aggressive" too? Maybe on top of all the current patches.
(In reply to comment #50) ... > Can you please test the "make shrinker less aggressive" too? Maybe on top of > all the current patches. Sure, will report if any problem.
Everyone please retest with latest drm-intel-fixes from http://cgit.freedesktop.org/~danvet/drm-intel I've just merged a bunch of duct-tapes for this issue.
Consolidating all gen4/5 i/o related hangs. *** This bug has been marked as a duplicate of bug 55984 ***
A patch referencing this bug report has been merged in Linux v3.8-rc4: commit 93927ca52a55c23e0a6a305e7e9082e8411ac9fa Author: Daniel Vetter <daniel.vetter@ffwll.ch> Date: Thu Jan 10 18:03:00 2013 +0100 drm/i915: Revert shrinker changes from "Track unbound pages"
Patch merged, closing.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.