Created attachment 68564 [details] error state Okay I've seen both my ilk machines gpu hang over the weekend, and I've never seen them do it before. I've got an error state from one at least, if its not in there then I suspect rc6.
The hang looks pretty clean (no suspicious operations), more or less upon the transition from a 3D to BLT within a UXA batch buffer; rc6 requiring w/a would not surprise me.
Ok, I've hunted around in our docs a bit and found a few ilk w/as we don't implement. Or at least what I think we miss, given our sorry state of docs. Pushed out to http://cgit.freedesktop.org/~danvet/drm/log/?h=ilk-wa-pile
Also if you want to pin the blame on rc6, i915.i915_enable_rc6=0...
okay got another death with rc6 disabled like Norbert. took about 3-4 days this time.
Created attachment 69415 [details] another error state [133200.848120] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung [133200.848128] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state [133202.367409] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung [133202.367692] [drm:i915_reset] *ERROR* GPU hanging too fast, declaring wedged! [133202.367695] [drm:i915_reset] *ERROR* Failed to reset chip. bits from dmesg. this is 3.6.0 + -next + ilks wa, I'll try and start a bisect on it now, 4-5 days a hang, back in a few years
That error-state is more consistent with a relocation failure than Norbert's - it fails trying to execute a composite operation within the middle of a batch.
Created attachment 69489 [details] i915_error_state.txt.gz Having seen https://lkml.org/lkml/2012/10/23/155 I think I am affected by the same bug. While I was compiling a kernel in a tmpfs, all of sudden KWin died. When I looked in dmesg, I saw: [95597.708097] pci 0000:01:00.0: power state changed by ACPI to D3cold [98683.176729] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung [98683.176736] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state [98683.184252] [drm:init_ring_common] *ERROR* failed to set render ring head to zero ctl 00000000 head 69c191cc tail 00000000 start 00003000 [98683.240710] [drm:init_ring_common] *ERROR* render ring initialization failed ctl 0001f001 head 69c191cc tail 00000000 start 00003000 [98686.163041] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung [98686.163202] [drm:i915_reset] *ERROR* GPU hanging too fast, declaring wedged! [98686.163205] [drm:i915_reset] *ERROR* Failed to reset chip. Attached is an i915_error_state from today, running 3.7-rc2-492-ge657e07. (only some ARM patches before 3.7-rc3). I remember that I had exactly the same error message in a -testing branch on 3.6 (http://cgit.freedesktop.org/~danvet/drm-intel/tag/?h=drm-intel-testing&id=drm-intel-next-2012-09-20). I built that kernel on Sep 21 and it locked up on Sep 27 (no rebooting, just suspends). If you want a dmesg (nothing interesting) or logs/i915_error_state from that 3.6 kernel, let me know. # lspci -vv -s 00:02.0 00:02.0 VGA compatible controller: Intel Corporation Core Processor Integrated Graphics Controller (rev 02) (prog-if 00 [VGA controller]) Subsystem: CLEVO/KAPOK Computer Device 7130 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0 Interrupt: pin A routed to IRQ 47 Region 0: Memory at fd000000 (64-bit, non-prefetchable) [size=4M] Region 2: Memory at d0000000 (64-bit, prefetchable) [size=256M] Region 4: I/O ports at 1800 [size=8] Expansion ROM at <unassigned> [disabled] Capabilities: [90] MSI: Enable+ Count=1/1 Maskable- 64bit- Address: fee0f00c Data: 4142 Capabilities: [d0] Power Management version 2 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- Capabilities: [a4] PCI Advanced Features AFCap: TP+ FLR+ AFCtrl: FLR- AFStatus: TP- Kernel driver in use: i915
Created attachment 69490 [details] Xorg.0.log In Xorg, I only changed to use SNA instead of the default (UXA?).
In case it gets lost, I bisected the hang to: 504c7267a1e84b157cbd7e9c1b805e1bc0c2c846 is the first bad commit commit 504c7267a1e84b157cbd7e9c1b805e1bc0c2c846 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Thu Aug 23 13:12:52 2012 +0100 drm/i915: Use cpu relocations if the object is in the GTT but not mappable This prevents the case of unbinding the object in order to process the relocations through the GTT and then rebinding it only to then proceed to use cpu relocations as the object is now in the CPU write domain. By choosing to use cpu relocations up front, we can therefore avoid the rebind penalty. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> :040000 040000 090ed3d52b4f3210b988877f747b6ff86e123385 1d48be89ded4777a543b693db833de64877059c4 M drivers
Ok, doesn't look like an rc6 thing, but very much like a regression.
Reverting that commit on top of 3.7-rc4 did not fix the hang issu. If you need any guinea pig for testing, here I am.
Created attachment 69604 [details] [review] disable unmappable Since right now we still have tons of signs pointing at unmappable gtt handling to be broken/non-coherent somehow, let's try this sledgehammer here and simply disable it all.
I applied that "sledgehammer" patch on 3.7-rc4, but the error persists. I saved dmesg and the i915_error_state file. If you need more information (or those logs), please give a call.
Can you please attach the new error_state with the sledgehammer? Maybe things shifted around enough to see what's going on ...
Created attachment 69615 [details] i915_error_state.gz with sledgehammer patch
(In reply to comment #15) > Created attachment 69615 [details] > i915_error_state.gz with sledgehammer patch Note that this hang is slightly different again, closer to the one reported by Norbert, in that the hang is the HEAD didn't advanced into the batchbuffer as opposed to a hang within or after the batch. So can you please try the hack in conjunction with the ilk-wa-pile?
Created attachment 69630 [details] i915_error_state.txt.gz on ilk-wa-pile 6ef21d3 + sledgehammer The issue still exists. Same errors in dmesg.
Ok, next interesting observation is that your error states both have a double emission of the request seqno, so perhaps submitting that many PIPE_CONTROL in sequence is triggering an error? Can you please test, on top of everything else, diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c index 3af1f2f..994d752 100644 --- a/drivers/gpu/drm/i915/intel_ringbuffer.c +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c @@ -662,10 +662,11 @@ pc_render_add_request(struct intel_ring_buffer *ring, * incoherence by flushing the 6 PIPE_NOTIFY buffers out to * memory before requesting an interrupt. */ - ret = intel_ring_begin(ring, 32); + ret = intel_ring_begin(ring, 34); if (ret) return ret; + intel_ring_emit(ring, MI_FLUSH); intel_ring_emit(ring, GFX_OP_PIPE_CONTROL(4) | PIPE_CONTROL_QW_WRITE | PIPE_CONTROL_WRITE_FLUSH | PIPE_CONTROL_TEXTURE_CACHE_INVALIDATE); @@ -691,6 +692,7 @@ pc_render_add_request(struct intel_ring_buffer *ring, intel_ring_emit(ring, pc->gtt_offset | PIPE_CONTROL_GLOBAL_GTT); intel_ring_emit(ring, seqno); intel_ring_emit(ring, 0); + intel_ring_emit(ring, MI_FLUSH); intel_ring_advance(ring); *result = seqno;
Created attachment 69639 [details] i915_error_state.txt.gz ilk-wa-pipe 6ef21d3 + sledgehammer + ring flush The bug is still triggered.
Created attachment 69656 [details] ilk-wa-pipe + sledgehammer + ring flush Same here, I got a hang with all the mentioned patches while compiling a big bunch of TeX Live. Error state is here now.
ARGH! Still it hangs in the middle of a series of requests (with no intervening batches or other operations). That should be impossible design wise, and improbable hardware wise.
Created attachment 69781 [details] i915 error state with ilk-wa-pipe + sledgehammer + ring flush (another hang) Here is another hang with a different error state (at least to my eyes). Happened when running git checkout on a big repository. No other messages.
Is there anything to test? I mentioned before that this occurs when the memory is almost full. I have no swap, but 8GB RAM. Copied five times 1.2GiB (=6GiB total) to tmpfs (/dev/shm and /tmp).
8GB machine (i3-330m) with no swap: $ mount -ttmpfs -osize=100% none /tmp/wtf $ while :; do yes wtf > /tmp/wtf/wtf; done & $ sudo X -ac -noreset & while :; do x11perf -aa10text -d :0; done with that I am able to repeatedly drive the machine to oom without triggering a GPU hang. Note I am using this set of patches on top of dinq: http://cgit.freedesktop.org/~ickle/linux-2.6/commit/?h=fastboot Peter, is that close enough to your test case to trigger the bug, or do I need to tweak it slightly? Can you please also test with the patches in fastboot, in case there is an accidental fix?
Created attachment 69833 [details] i915_error_state.txt.gz ickle/linux-2.6 fastboot with sledgehammer + ring flush The bug still triggers, w/ and w/o the sledgehammer+ring flush patches. The dmesg is now slightly different on the ickle/linux-2.6 fastboot branch: [ 501.214949] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung [ 501.214958] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state [ 501.219393] [drm:init_ring_common] *ERROR* failed to set render ring head to zero ctl 00000000 head 09e16d8c tail 00000000 start 00300000 [ 501.262795] [drm:intel_dp_aux_wait_done] *ERROR* dp aux hw did not signal timeout (has irq: 1)! [ 501.274784] [drm:init_ring_common] *ERROR* render ring initialization failed ctl 0001f001 head 09e16d8c tail 00000000 start 00300000 [ 501.302762] [drm:intel_dp_aux_wait_done] *ERROR* dp aux hw did not signal timeout (has irq: 1)! [ 502.274145] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung [ 502.274293] [drm:i915_reset] *ERROR* GPU hanging too fast, declaring wedged! [ 502.274298] [drm:i915_reset] *ERROR* Failed to reset chip. One thing that I now notice (because I did not try it before) is that switching to a text console (ctrl+alt+F1) gives me a black screen with some flashing large rectangles on screen (possibly the flashing cursor for the username) Starting just X from a TTY and then running x11perf/glxgears/glxspheres with cp linux tree / yes / dd if=/dev/zero of=/tmp/wtf.. did not work. Even if OOM killed half my machine I can only reproduce it after logging into KDE and running a GL program (like glxgears). x11perf does not trigger the bug, even in KDE. Maybe other (compositing) window managers work too, but I have not tested that. I am using the below bash script after logging into KDE. After starting this script, I watch the kernel log (journalctl -f) and run `glxgears`. #!/bin/bash mkdir -p /tmp/wtf mountpoint /tmp/wtf||sudo mount -osize=6200M -t tmpfs none /tmp/wtf echo 15 > /proc/$$/oom_score_adj # just in case... pids= for i in /tmp/wtf/hang-{1..6}; do rm -rf "$i" #yes wtf > $i & # did not work cp -ra ~/Linux-src/linux "$i" & pids="$pids $!" done trap "kill $pids" EXIT wait
Peter, have you tested i915.i915_enable_rc6=0 (on top of the sledgehammer and w/a)? You have a most peculiar failure pattern where the GPU should be idle and then dies in a flush.
A similar bug on i965gm (bug #56916) mentions that things _only_ blow up when a mesa program is running. So everyone who can hit this, please reply with your exact mesa version and what (if any) GL programs you have running when this happens (GL compositor, ...). Also, those who can readily reproduce the hangs, please check whether stopping all GL clients (disable the compositor or use a non-GL one) prevents the hangs.
New results: - with rc6 disabled, both 3.7-rc4 and wa+sledgehammer+ringflush does not expose the bug - with rc6 not disabled (i.e. the default, -1), wa+sledgehammer+ringflush and GL compositing disabled in KWin, the bug is not trigerred. (in the same boot, GL compositing was enabled again and the bug shows up) I am using the standard Mesa packages shipped with Arch Linux, that is 9.0. The bug is triggered when KDE's KWin is active and glxgears is running. (instead of glxgears, I first tried glxspheres which triggers the bug too)
Definitely looks like we have a pair of independent unresolved "cpu-relocs" and rc6 issues.
Ok, it looks like we have different bugs here, or at least non-overlapping sets of workarounds :( Peter Wu, can you please check what happens when you manually enable rc6 on a 3.6 kernel? Norbert Preining, test-results for your machine wrt rc6 vs. "mesa client/compositor running" vs. 3.6/3.7-rc would be really interesting, since iirc you can blow up your machine rather quickly, too.
3.6.6 w/o patches, w/ i915.i915_enable_rc6=1, w/ OpenGL compositing WM (KWin) and glxgears does *not* trigger the bug. I do get a very sluggish desktop which ultimately leads to some OOMs, but that is normal. If it helps, I have tested the stock arch kernel config: https://projects.archlinux.org/svntogit/packages.git/tree/trunk/config.x86_64?h=packages/linux&id=89de8dc7df6894c219e746326ca338e9279c2e3f and my own config: https://github.com/Lekensteyn/aur/blob/13feda6a55fb67c912c0611dc0c019bb084e7560/linux-custom/config
(In reply to comment #30) > Norbert Preining, test-results for your machine wrt rc6 vs. "mesa > client/compositor running" vs. 3.6/3.7-rc would be really interesting, since > iirc you can blow up your machine rather quickly, too. I am running now with rc6 disabled and all the patches mentioned above. I am trying with Gnome3 and some GLX programs to see what I can do. Norbert
Created attachment 69900 [details] i915 error state, i915_enable_rc6=0, rc4 + ilk-wa-pipe + sledgehammer + ring flush As requested, here is another hang with rc6 disabled and the above patches. Happened again when doing heavy photo viewing with quick switching in shotwell. If you need other configurations or tests, please let me know Norbert
(In reply to comment #33) > Created attachment 69900 [details] > i915 error state, i915_enable_rc6=0, rc4 + ilk-wa-pipe + sledgehammer + ring > flush Same hang as before on your machine between a rectlist PRIM and a BLT. > As requested, here is another hang with rc6 disabled and the above patches. > Happened again when doing heavy photo viewing with quick switching in > shotwell. To check: Is this with a GL client/compositor running? > If you need other configurations or tests, please let me know If the above is with a GL client, then trying to hang the box without any GL client/compositor running would be interesting.
(In reply to comment #34) > Same hang as before on your machine between a rectlist PRIM and a BLT. Ok, at least repeatable ;-) So in my case rc6 does not make a change, former one was without any specific rc6 cmdline. > To check: Is this with a GL client/compositor running? Gnome3, so I guess there is a compositor running. > If the above is with a GL client, then trying to hang the box without any GL > client/compositor running would be interesting. Hmm, what WM could I use, guess I have to try fvwm back again. Will try in one way or the other. Norbert
Norbet, since you see a slightly different presentation of this bug, it would be useful if you could also test http://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=fastboot which despite its name also contains some work on the mb() around the relocations.
(In reply to comment #36) > http://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=fastboot which despite > its name also contains some work on the mb() around the relocations. Ok, there is one merge conflict with current kernel master, but I am trying to build the kernel now after fixing the conflict in one way (keeping the code). I tried also to merge that with the ilk-pile but that was hopeless with loads of merge conflicts. Will give feedback as soon as I can. Norbert
Our QA discovered a random corruption issue (bug #56859) and bisected it to commit 7f1290f2f2a4d2c3f1b7ce8e87256e052ca23125 Author: Jianguo Wu <wujianguo@huawei.com> Date: Mon Oct 8 16:33:06 2012 -0700 mm: fix-up zone present pages Can those who can reproduce this bug here easily please test whether reverting that commit changes anything?
Reverting that commit on top of 3.7-rc5-git-14-g9924a19 does not help.
(just restoring priority so it doesn't fall out of our p1 lists)
So one thing worth trying is: diff --git a/drivers/char/agp/intel-gtt.c b/drivers/char/agp/intel-gtt.c index 7ad6d13..6177daa 100644 --- a/drivers/char/agp/intel-gtt.c +++ b/drivers/char/agp/intel-gtt.c @@ -573,7 +573,7 @@ static int intel_gtt_init(void) return ret; intel_private.base.gtt_mappable_entries = intel_gtt_mappable_entries(); - intel_private.base.gtt_total_entries = intel_gtt_total_entries(); + intel_private.base.gtt_total_entries = intel_gtt_mappable_entries(); /* save the PGETBL reg for resume */ intel_private.PGETBL_save = (It's a bit shotgun, but if it still continues to fail after that all the earlier symptoms have just been canaries.)
(In reply to comment #41) > So one thing worth trying is: > > diff --git a/drivers/char/agp/intel-gtt.c b/drivers/char/agp/intel-gtt.c > index 7ad6d13..6177daa 100644 > --- a/drivers/char/agp/intel-gtt.c > +++ b/drivers/char/agp/intel-gtt.c > @@ -573,7 +573,7 @@ static int intel_gtt_init(void) > return ret; > > intel_private.base.gtt_mappable_entries = > intel_gtt_mappable_entries(); > - intel_private.base.gtt_total_entries = intel_gtt_total_entries(); > + intel_private.base.gtt_total_entries = intel_gtt_mappable_entries(); > > /* save the PGETBL reg for resume */ > intel_private.PGETBL_save = > > (It's a bit shotgun, but if it still continues to fail after that all the > earlier symptoms have just been canaries.) Looks eerily similar to attachment #69604 [details] [review] i.e. has been tried, doesn't work on at least Peter's machine.
Created attachment 70111 [details] [review] disable unbound tracking Silly me just noticed that the unbound tracking has been merged into 3.7, not 3.6. This has a big enough impact to explain all kinds of things. Please try the attached patch, thanks.
(In reply to comment #43) > Created attachment 70111 [details] [review] [review] > disable unbound tracking > > Silly me just noticed that the unbound tracking has been merged into 3.7, > not 3.6. This has a big enough impact to explain all kinds of things. Please > try the attached patch, thanks. i.e. i915_gem_object_set_to_cpu_domain(obj, true); on unbind which would more explicitly test the failure mechanism.
Created attachment 70114 [details] [review] always do set-gtt-domain As a follow-on test, one of the areas where we short-circuit domain tracking that may be fouled up by not calling set-to-cpu-domain upon unbind.
Created attachment 70142 [details] [review] always do set-gtt-domain Better patch, maybe a fix for something...
(In reply to comment #46) > Created attachment 70142 [details] [review] [review] > always do set-gtt-domain > > Better patch, maybe a fix for something... On top of what should we try that? rc5 plain? rc5+ilk-pile? ...? Only this patch or some others from this thread, too? Thanks Norbert
(In reply to comment #47) > (In reply to comment #46) > > Created attachment 70142 [details] [review] [review] [review] > > always do set-gtt-domain > > > > Better patch, maybe a fix for something... > > On top of what should we try that? rc5 plain? rc5+ilk-pile? ...? > Only this patch or some others from this thread, too? Plain 3.7-rc kernel, just pick one that's broken ;-) Please also test the patch in comment #43 since that one tests a different theory.
(In reply to comment #48) > Plain 3.7-rc kernel, just pick one that's broken ;-) Please also test the > patch in comment #43 since that one tests a different theory. Ok, compiling now. Thanks
(In reply to comment #48) > Plain 3.7-rc kernel, just pick one that's broken ;-) Please also test the > patch in comment #43 since that one tests a different theory. #44 / #46 are both elements of #43... All 3 are worth testing independently.
Still broken with the same dmesg messages: - 3.7.0-rc5-git-68-gc5e35d6 + disable unbound tracking - 3.7.0-rc5-git-68-gc5e35d6 + always-do-set-to-gtt Do you want me to add the i915_error_states?
Peter Wu, since you seem to have dug out the only bisect result (which didn't check out when reverting), can you please check whether the parent of the bad commit is working out for you perfectly well? Afaict this should be commit 0327d6ba998ca181013a5a1709701a6532a41972 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Sat Aug 11 15:41:06 2012 +0100 drm/i915: Extract general object init routine Pretty much all hairy changes in gem for 3.7 are before that commit, so knowing that things are solid with this sha1 would be rather helpful. So please beat on this extensively, thanks.
Created attachment 70168 [details] [review] disable cpu relocs completely I'm not completely sure, but I think we haven't ruled this one out yet. Please test, thanks.
Still affected: - 3.6.0-rc2-git-87-g0327d6b no patches - 3.7.0-rc5-git-68-gc5e35d6 + disable-cpu-relocs Do I need to combine some patches? E.g. the disable-cpu-relocs with sledgehammer + ring flush?
To hunt down a few other theories, can everyone please attach the complete dmesg (doesn't really matter whether with drm.debug or not, kernel version also doesn't matter).
Created attachment 70188 [details] dmesg/error_state on ilk/drm-intel-nightly/ubuntu 12.10
Created attachment 70192 [details] dmesg from 3.6.0-rc2-git-87-g0327d6b no patches
Created attachment 70214 [details] dmesg from 3.7.0-rc5+ Here my dmesg from current running kernel. I just returned from a travel and will try the patch from comment 53 on top of the patches in comments 43 and 46 Norbert
Created attachment 70248 [details] [review] use dma32 for gem bo allocations It's not very likely, but on the off chance that this helps, please test.
Ok, yet another new theory ... everyone please attach your kernel .config, thanks.
Created attachment 70271 [details] kernel config used for 3.7.x I haven't tested the patch from comment 59 yet, but here is my kernel config.
Created attachment 70279 [details] config-3.7.0-rc4-g6283022
Created attachment 70286 [details] Norbert's kernel config
Created attachment 70296 [details] bug hit with patches 3.7-rc6 plus patches from #43 and #46, but not #59 Here is another hang when running the two patches 43 and 46. Immediately after that I got also a page alloc failure, here the syslog messages: Nov 20 14:32:57 tofuschnitzel kernel: [55009.700562] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung Nov 20 14:32:57 tofuschnitzel kernel: [55009.700571] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state Nov 20 14:32:58 tofuschnitzel kernel: [55011.204741] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung Nov 20 14:32:58 tofuschnitzel kernel: [55011.204853] [drm:i915_reset] *ERROR* GPU hanging too fast, declaring wedged! Nov 20 14:32:58 tofuschnitzel kernel: [55011.204858] [drm:i915_reset] *ERROR* Failed to reset chip. Nov 20 14:33:58 tofuschnitzel kernel: [55071.390112] cat: page allocation failure: order:9, mode:0x2000d0 Nov 20 14:33:58 tofuschnitzel kernel: [55071.390158] Pid: 17244, comm: cat Not tainted 3.7.0-rc6+ #42 Nov 20 14:33:58 tofuschnitzel kernel: [55071.390188] Call Trace: Nov 20 14:33:58 tofuschnitzel kernel: [55071.390214] [<ffffffff81095abc>] warn_alloc_failed+0x10a/0x11e Nov 20 14:33:58 tofuschnitzel kernel: [55071.390247] [<ffffffff81096f44>] ? page_alloc_cpu_notify+0x3e/0x3e Nov 20 14:33:58 tofuschnitzel kernel: [55071.390280] [<ffffffff81096f55>] ? drain_local_pages+0x11/0x13 Nov 20 14:33:58 tofuschnitzel kernel: [55071.390313] [<ffffffff81097de2>] __alloc_pages_nodemask+0x5a0/0x5e2 Nov 20 14:33:58 tofuschnitzel kernel: [55071.390351] [<ffffffff810beaf0>] ____cache_alloc+0x2b5/0x544 Nov 20 14:33:58 tofuschnitzel kernel: [55071.390382] [<ffffffff810beddd>] __kmalloc+0x5e/0x96 Nov 20 14:33:58 tofuschnitzel kernel: [55071.390413] [<ffffffff810ddeba>] seq_read+0x1c3/0x324 Nov 20 14:33:58 tofuschnitzel kernel: [55071.390443] [<ffffffff810c4d6a>] vfs_read+0x98/0xfa Nov 20 14:33:58 tofuschnitzel kernel: [55071.390470] [<ffffffff810c4e19>] sys_read+0x4d/0x7a Nov 20 14:33:58 tofuschnitzel kernel: [55071.390500] [<ffffffff814bb9d2>] system_call_fastpath+0x16/0x1b Nov 20 14:33:58 tofuschnitzel kernel: [55071.390532] Mem-Info: Nov 20 14:33:58 tofuschnitzel kernel: [55071.390547] DMA per-cpu: Nov 20 14:33:58 tofuschnitzel kernel: [55071.390565] CPU 0: hi: 0, btch: 1 usd: 0 Nov 20 14:33:58 tofuschnitzel kernel: [55071.390591] CPU 1: hi: 0, btch: 1 usd: 0 Nov 20 14:33:58 tofuschnitzel kernel: [55071.390617] CPU 2: hi: 0, btch: 1 usd: 0 Nov 20 14:33:58 tofuschnitzel kernel: [55071.390643] CPU 3: hi: 0, btch: 1 usd: 0 Nov 20 14:33:58 tofuschnitzel kernel: [55071.390669] DMA32 per-cpu: Nov 20 14:33:58 tofuschnitzel kernel: [55071.390686] CPU 0: hi: 186, btch: 31 usd: 0 Nov 20 14:33:58 tofuschnitzel kernel: [55071.390712] CPU 1: hi: 186, btch: 31 usd: 185 Nov 20 14:33:58 tofuschnitzel kernel: [55071.391920] CPU 2: hi: 186, btch: 31 usd: 0 Nov 20 14:33:58 tofuschnitzel kernel: [55071.393113] CPU 3: hi: 186, btch: 31 usd: 0 Nov 20 14:33:58 tofuschnitzel kernel: [55071.394320] Normal per-cpu: Nov 20 14:33:58 tofuschnitzel kernel: [55071.395487] CPU 0: hi: 186, btch: 31 usd: 0 Nov 20 14:33:58 tofuschnitzel kernel: [55071.396654] CPU 1: hi: 186, btch: 31 usd: 152 Nov 20 14:33:58 tofuschnitzel kernel: [55071.397839] CPU 2: hi: 186, btch: 31 usd: 28 Nov 20 14:33:58 tofuschnitzel kernel: [55071.398968] CPU 3: hi: 186, btch: 31 usd: 0 Nov 20 14:33:58 tofuschnitzel kernel: [55071.400072] active_anon:115042 inactive_anon:65045 isolated_anon:0 Nov 20 14:33:58 tofuschnitzel kernel: [55071.400072] active_file:264259 inactive_file:405769 isolated_file:0 Nov 20 14:33:58 tofuschnitzel kernel: [55071.400072] unevictable:22 dirty:7 writeback:0 unstable:0 Nov 20 14:33:58 tofuschnitzel kernel: [55071.400072] free:58783 slab_reclaimable:45728 slab_unreclaimable:11528 Nov 20 14:33:58 tofuschnitzel kernel: [55071.400072] mapped:15992 shmem:10670 pagetables:6769 bounce:0 Nov 20 14:33:58 tofuschnitzel kernel: [55071.400072] free_cma:0 Nov 20 14:33:58 tofuschnitzel kernel: [55071.406629] DMA free:15748kB min:540kB low:672kB high:808kB active_anon:0kB inactive_anon:4kB active_file:120kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15672kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:20kB slab_unreclaimable:4kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no Nov 20 14:33:58 tofuschnitzel kernel: [55071.410109] lowmem_reserve[]: 0 2925 3808 3808 Nov 20 14:33:58 tofuschnitzel kernel: [55071.411302] DMA32 free:172664kB min:103388kB low:129232kB high:155080kB active_anon:357940kB inactive_anon:99880kB active_file:879092kB inactive_file:1375920kB unevictable:16kB isolated(anon):0kB isolated(file):0kB present:2995364kB mlocked:16kB dirty:12kB writeback:0kB mapped:46008kB shmem:12988kB slab_reclaimable:140052kB slab_unreclaimable:11464kB kernel_stack:584kB pagetables:5324kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:18 all_unreclaimable? no Nov 20 14:33:59 tofuschnitzel kernel: [55071.415174] lowmem_reserve[]: 0 0 883 883 Nov 20 14:33:59 tofuschnitzel kernel: [55071.416495] Normal free:46720kB min:31236kB low:39044kB high:46852kB active_anon:102228kB inactive_anon:160296kB active_file:177824kB inactive_file:247156kB unevictable:72kB isolated(anon):0kB isolated(file):0kB present:904960kB mlocked:72kB dirty:16kB writeback:0kB mapped:17960kB shmem:29692kB slab_reclaimable:42840kB slab_unreclaimable:34644kB kernel_stack:2592kB pagetables:21752kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no Nov 20 14:33:59 tofuschnitzel kernel: [55071.420657] lowmem_reserve[]: 0 0 0 0 Nov 20 14:33:59 tofuschnitzel kernel: [55071.422091] DMA: 3*4kB 3*8kB 0*16kB 1*32kB 3*64kB 3*128kB 3*256kB 2*512kB 3*1024kB 3*2048kB 1*4096kB = 15748kB Nov 20 14:33:59 tofuschnitzel kernel: [55071.423560] DMA32: 2116*4kB 2105*8kB 5306*16kB 1052*32kB 192*64kB 39*128kB 13*256kB 10*512kB 1*1024kB 1*2048kB 0*4096kB = 172664kB Nov 20 14:33:59 tofuschnitzel kernel: [55071.425055] Normal: 2764*4kB 1152*8kB 951*16kB 237*32kB 27*64kB 1*128kB 1*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 48768kB Nov 20 14:33:59 tofuschnitzel kernel: [55071.426553] 695040 total pagecache pages Nov 20 14:33:59 tofuschnitzel kernel: [55071.428009] 14351 pages in swap cache Nov 20 14:33:59 tofuschnitzel kernel: [55071.429508] Swap cache stats: add 236320, delete 221969, find 63452/74380 Nov 20 14:33:59 tofuschnitzel kernel: [55071.431003] Free swap = 9658608kB Nov 20 14:33:59 tofuschnitzel kernel: [55071.432418] Total swap = 9905148kB Nov 20 14:33:59 tofuschnitzel kernel: [55071.446640] 1015792 pages RAM Nov 20 14:33:59 tofuschnitzel kernel: [55071.448108] 37280 pages reserved Nov 20 14:33:59 tofuschnitzel kernel: [55071.449587] 1201194 pages shared Nov 20 14:33:59 tofuschnitzel kernel: [55071.451024] 804246 pages non-shared Nov 20 14:33:59 tofuschnitzel kernel: [55071.452459] SLAB: Unable to allocate memory on node 0 (gfp=0xd0) Nov 20 14:33:59 tofuschnitzel kernel: [55071.454003] cache: size-2097152, object size: 2097152, order: 9 Nov 20 14:33:59 tofuschnitzel kernel: [55071.455474] node 0: slabs: 0/0, objs: 0/0, free: 0 Maybe that helps. Now I am running #43, #46, #59.
Another hang with #43, #46, #59 patches. Is the i915 error state needed? It always happens while I am doing heavy IO things. This time pbuilder/cowbuilder installation tests of 1Gb of new packages on Debian. Norbert
One question we haven't ask is whether this is a genuine hang or an unfortunate hangcheck? Can you please reproduce with i915.enable_hangcheck=0 and see if your machine locks up instead of reporting the hang?
For me bisection pointed to commit #6c085a72 - drm/i915: Track unbound pages. A couple of additional test runs at this and its parent commit proves this. I'll try patches from comment #43-#46. Since those didn't fix the problem for Norbert it might be we have multiple issues. Chris: running with enable_hangcheck=0 on #6c085a72 the machine locked up.
Imre suggests that there is a possible fix in http://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=for-imre. Can people please try that branch and see if it does improve matters fort them?
Hi Chris, I am now running with http://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=for-imre pulled into main linux git. The first thing I realized that there are some very strange effects happening: I have docky (a panel) running, and set to auto-hide. If it is shown, then there are boxes around *some* of the icons there. And if the docky panel is going into hiding mode, then a nice green bar appears across my screen. Is this a known problem? Should I report it somewhere? I have made two screenshots showing the effects, should I upload them here or somewhere else? Now I do some testing with these patches Norbert
(In reply to comment #69) > Hi Chris, > > I am now running with > http://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=for-imre pulled into > main linux git. > > The first thing I realized that there are some very strange effects > happening: I have docky (a panel) running, and set to auto-hide. If it is > shown, then there are boxes around *some* of the icons there. And if the > docky panel is going into hiding mode, then a nice green bar appears across > my screen. > > Is this a known problem? Should I report it somewhere? I have made two > screenshots showing the effects, should I upload them here or somewhere else? No, that's a little unexpected...But for now focus on the question whether the original hang is reproducible and I'll build a second tree with just the likely fixes.
I've put a smaller selection of patches in http://cgit.freedesktop.org/~ickle/linux-2.6/commit/?h=bug55984. It's still a shotgun approach, but a good first step will be to see if it cures the hang..
Created attachment 70428 [details] Script to trigger the bug. This seems to be quickest way to repro the bug on Ville's ILK
(In reply to comment #72) > Script to trigger the bug. > > This seems to be quickest way to repro the bug on Ville's ILK I couldn't trigger anything with that, it just happily continued for ages. Here it seems that big IO for read and write to be necessary, while here is only read.
Running current linux HEAD with http://cgit.freedesktop.org/~ickle/linux-2.6/commit/?h=bug55984 pulled in I cannot trigger the crash, how hard I have tried now for some time. Compared with the for-imre branch also the strange artifacts are gone, so it looks much better now. I have still the following boot cmd line: i915.i915_enable_rc6=0 i915.enable_hangcheck=0 I will do more testing, of course Norbert
Okay, I guess I have to recall my statement, I didn't realize it at first. Due to the i915.enable_hangcheck=0 it seemss that not simply the 3d died and Gnome3 WM died, but with this the screen went black and didn't react on anything. Interestingly there were no messages at all in the log files. I could SysRq the computer, and the logfiles showed activity, but the screen remain black. I guess that is the sign of a hang. Pity.
Ok, after banging my head against this for several days, I have decreed that the death within the render ring (inside the sequence of FLUSH PIPE_CONTROLx8 FLUSH) is due to enabling of rc6 on ILK. That doesn't explain all the crashes, but it does explain the "immediate" crashes on danvet-ilk using Daniel's killscript. The remaining crashes, where the GPU vanishes in mid-batch, are what we need to try and reproduce - and I hope the common factor between this bug and #57122, #56916, #57136.
With http://cgit.freedesktop.org/~ickle/linux-2.6/commit/?h=bug55984, the situation does not change, i.e. still lockup message and vanishing 3D capailities. (this bug is still marked NEEDINFO, do you need more details?)
The last bit of information we need is how to reproduce the non-rc6 related hangs - all the killscripts we've generated so far seem to hit the rc6 issue, afaik. (The highest priority is used for internal bug tracking.)
Created attachment 70577 [details] [review] Don't force GTT/CPU relocations Today's patch, please disable rc6 whilst testing.
Chris, please let us know on top of what? On top of the bug55984 git branch you created earlier, or is that one not necessary? Thanks
(In reply to comment #80) > Chris, please let us know on top of what? On top of the bug55984 git branch > you created earlier, or is that one not necessary? In isolation, so on top of 3.7-rc7 or drm-intel-fixes. The goal is to both understand the issue and develop a minimal patch in time for 3.7. So yesterday...
No news... Has the instadeath gone, but the slow lingering death remains?
Hi Chris, > No news... Has the instadeath gone, but the slow lingering death remains? Well, I am running the latest patch on top of git and try to trigger the bug, till now without success, though. I cannot say more or less, at least the frequency has reduced. Let me know if I can help more than just trying to trigger it again. Norbert
i915.i915_enable_rc6=0 unables me to trigger the bug. With the patch applied on top of 3.7-rc7, the bug is still not exposed (as expected).
Created attachment 70855 [details] i915 error state, rc6=0, patch from comment 79 Hi Chris, sorry to say it, but I got a hang today. In the background some update.mlocate etc was running, plus some git checkout of a big repository. i915 error state uploaded. Norbert
Norbet, if you can reproduce it with SNA, due to the packing of the batchbuffer I can have much better idea of what is going on. The suspicion is definitely some dodgy state in the surface packet - but at this moment in time, it could even be an alignment issue that's been hidden by buffer layout. :|
Hi Chris, ok, will switch to SNA, although in one of the email threads predating this bug report I switched to SNA and then was told not to. Anyway, trying to recreate it with SNA. Norbert
(In reply to comment #87) > Hi Chris, > > ok, will switch to SNA, although in one of the email threads predating this > bug report I switched to SNA and then was told not to. Dave didn't want to muddy the waters and make sure we make sure we understand the root cause of the regression with UXA; I'm trying to use it as a diagnostic.
First report on SNA: I got a very very strange thing yesterday. After resuming from suspend-to-ram I continued working with GIMP on some graphics, and first it started with some blue flashes on the screen, and finally the screen got completely blue, but I could in principle still interact with the windows, just without seeing anything. Restarting X (gdm) did fix it for me. Doing more heavy IO to stress test SNA. Norbert
(In reply to comment #89) > First report on SNA: I got a very very strange thing yesterday. After > resuming from suspend-to-ram I continued working with GIMP on some graphics, > and first it started with some blue flashes on the screen, and finally the > screen got completely blue, but I could in principle still interact with the > windows, just without seeing anything. Interesting, certainly the first I've heard of such. Can you see if it is possible to capture it in a screenshot, or failing that a photograph, and please open a bug report for it.
this bug is similar or the same as this..: https://bugzilla.kernel.org/show_bug.cgi?id=49571 my bisect is pointing to the changes with device_cgroup.c [66b8ef67756b3051bf42a077a82c3c5c279caa5b] device_cgroup: add "deny_all" in dev_cgroup structure I have two more revisions to test, but am sure they wont matter since they are for fat and kernel-doc once I am done bisecting I will pull to the current Mainline and run it to see if the fixes in device_cgroup fix this for me, then will go from there.
Hi Chris, (In reply to comment #90) > (In reply to comment #89) > > First report on SNA: I got a very very strange thing yesterday. After > > resuming from suspend-to-ram I continued working with GIMP on some graphics, > > and first it started with some blue flashes on the screen, and finally the > > screen got completely blue, but I could in principle still interact with the > > windows, just without seeing anything. > > Interesting, certainly the first I've heard of such. Can you see if it is > possible to capture it in a screenshot, or failing that a photograph, and > please open a bug report for it. sorry for the late reply, I was on a trip. Since it only happened once and on some rc kernels I leave it for now, but a screenshot is not necessary, it was simply monocolor blue ... nothing else ;-) Another thing: I have now tried to hit the bug with SNA for a long time, without *any* success. As soon as I switched back to IXA the bug was triggered within short time. Does this help you? Norbert
(In reply to comment #92) > Another thing: I have now tried to hit the bug with SNA for a long time, > without *any* success. As soon as I switched back to IXA the bug was > triggered within short time. > > Does this help you? As a null data-point, yes. My interpretation is then back towards a relocation/mm issue. (There is an outside chance that some of the surface alignment tweaks help, for which getting the SURFACE_STATE would have been useful...) Norbert, a quick scan of the bug report doesn't yield any information as to whether you tested the mb() theory. Can you please try: http://cgit.freedesktop.org/~ickle/linux-2.6 #master Just compile the master branch (at 3.7-rc4).
Created attachment 71437 [details] [review] Keep reserved objects pinned until after reloction processing. An idea at last: earlier objects are moved in order to perform the relocations. This should be impossible as we try to detect when we are going to require GTT access for relocation processing and reserve it in the right spot. However, the code change is minor and it should be easy enough to test...
Created attachment 71439 [details] [review] Keep reserved objects pinned until after reloction processing.
The "Keep reserved objects pinned" patch on 3.7.0 does not fix the hang for me. The only thing I've seen so far that stops it is the last kernel I compiled for my bisect to bad commit 6c085a728cf000ac1865d66f8c9b52935558b328. I believe a couple others have also bisected to the same result.
(In reply to comment #93) > Norbert, a quick scan of the bug report doesn't yield any information as to > whether you tested the mb() theory. Can you please try: > > http://cgit.freedesktop.org/~ickle/linux-2.6 #master Comment 74 and 75 seem to indicate that I tried at least a subset. Do you want me to try it with the full branch? Norbert
(In reply to comment #97) > (In reply to comment #93) > > Norbert, a quick scan of the bug report doesn't yield any information as to > > whether you tested the mb() theory. Can you please try: > > > > http://cgit.freedesktop.org/~ickle/linux-2.6 #master > > Comment 74 and 75 seem to indicate that I tried at least a subset. > Do you want me to try it with the full branch? Yes, they were testing for specific ideas. I'm back to trying a shotgun approach.
(In reply to comment #96) > The "Keep reserved objects pinned" patch on 3.7.0 does not fix the hang for > me. The only thing I've seen so far that stops it is the last kernel I > compiled for my bisect to bad commit > 6c085a728cf000ac1865d66f8c9b52935558b328. I believe a couple others have > also bisected to the same result. Brad, can you please attach an i915_error_state from your hang and Xorg.log?
Created attachment 71521 [details] i915_error_state and Xorg log for 3.7.0+keep reserved pinned patch Requested files
(In reply to comment #100) > Created attachment 71521 [details] > i915_error_state and Xorg log for 3.7.0+keep reserved pinned patch rc6 is disabled on Ironlake precisely because it is causing the lockup you are encountering. We already know that we are missing workarounds for enabling rc6 on Ironlake.
A patch referencing this bug report has been merged in Linux v3.7-rc8: commit 6567d748c4e94e3481e523803ec07ebd825c80d6 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Sat Nov 10 10:00:06 2012 +0000 Revert "drm/i915: enable rc6 on ilk again"
Created attachment 71651 [details] i915_error_state.txt.gz An easily reproducible hung, though might be unrelated to the one we are after here.
(In reply to comment #103) > Created attachment 71651 [details] > i915_error_state.txt.gz > > An easily reproducible hung, though might be unrelated to the one we are > after here. That's a spectacular broken mesa batch buffer. Its surface state base doesn't point anywhere near a buffer.
Please try out the patch at https://patchwork.kernel.org/patch/1885411/ It has a decent chance to reduce gtt trashing, which might be good enough to again ducttape over the hangs. Or maybe change the pattern to be able to reproduce it much quicker. In any case, should be interesting ...
I got the gpu hung error after 45 minutes on 3.7.1 with rc6=0 and the evict blocks patch applied.
Created attachment 71805 [details] [review] make the shrinker less aggressive Duct-tape solution if it is one, but imo very much worth a try.
Created attachment 71926 [details] [review] Mark unused portions of the GTT as invalid Working on the theory that this is an invalid access beyond the end of a bo, this should hopefully enable GPU detection.
Created attachment 71933 [details] [review] Align surface sizes to an even tile row And this is the complaint the GPU found. Please test this with your IO heavy workloads.
Hi Daniel, hi Chris (In reply to comment #107) > Created attachment 71805 [details] [review] [review] > make the shrinker less aggressive > > Duct-tape solution if it is one, but imo very much worth a try. There have been a lot of patches floating around, but I was running 3.7.0 plus this patch now for a while, and using UXA (*not* using SNA, with SNA it always was fine). I did not hit any problem till now, although I did heavy IO stuff as usual (svn up and git svn rebase on two 6+Gb repos), etc. Chris: What should I do next? You have posted two patches (108 and 109 comments), should I try both, or each on independently? Or is the information that patch from 107 is fine (till now) enough? Please let me know, and thanks for your work on that, and Merry Christmas! Norbert
(In reply to comment #110) > Hi Daniel, hi Chris > > (In reply to comment #107) > > Created attachment 71805 [details] [review] [review] [review] > > make the shrinker less aggressive > > > > Duct-tape solution if it is one, but imo very much worth a try. > > There have been a lot of patches floating around, but I was running 3.7.0 > plus this patch now for a while, and using UXA (*not* using SNA, with SNA it > always was fine). > > I did not hit any problem till now, although I did heavy IO stuff as usual > (svn up and git svn rebase on two 6+Gb repos), etc. Meh. As that patch is basically changing the ordering of the objects considered for shrinking, it just opens a can of worms - but it does seem to be a stopgap workaround. > Chris: What should I do next? You have posted two patches (108 and 109 > comments), should I try both, or each on independently? #108 is a means of provoking the GPU to spot a lot more errors, it needs a little more refinement to not first fallover on standard UXA behaviour. I'd be interested in seeing if #109 has any effect at all (on stock 3.7.0 + uxa). There is also https://patchwork.kernel.org/patch/1896161/ that would be useful to test (as it fixes a real bug and it would be cool if it was also an effective workaround here). > Or is the information that patch from 107 is fine (till now) enough? It is merely the start. :-p
Hi, I will try the patch from 109 in combination with the patchwork patch next. I think I *did* try the patchwork test recently. So I will go silent now and report back in a few days if no problems arose, or immediately if it freezes again. Thanks Norbert
[reply to comment 107] I haven't tried it yet, do you still want me to test it? [reply to comment 108] I cannot apply it on top of 3.7.1. What base do you want me to test it on? [reply to comment 109] Do I need to apply this to xf86-video-intel? If yes, which version/commit? [reply to comment 111] I tried to apply https://patchwork.kernel.org/patch/1896161/ on top of vanilla v3.7.1, but I could not get it to compile: drivers/gpu/drm/drm_mm.c: In function ‘drm_mm_scan_remove_block’: drivers/gpu/drm/drm_mm.c:612:3: error: implicit declaration of function ‘__drm_mm_hole_node_end’ [-Werror=implicit-function-declaration] Did you mean drm_mm_hole_node_end?
(In reply to comment #113) > [reply to comment 109] > Do I need to apply this to xf86-video-intel? If yes, which version/commit? Yes, it is xf86-video-intel. I am running it with the current version of Debian/sid + this patch now (2.20.14-1 + the patch) > [reply to comment 111] > I tried to apply https://patchwork.kernel.org/patch/1896161/ on top of > vanilla v3.7.1, but I could not get it to compile: Hmm, I have compiled it with 3.7.0 git kernel without a problem, some offset while patching but that was all. Norbert
(In reply to comment #114) > Yes, it is xf86-video-intel. I am running it with the current version of > Debian/sid + this patch now (2.20.14-1 + the patch) Is it a standalone patch or does it depend on the former DRM patch? Anyway, I have tested it with 3.7.1 + i915.i915_enable_rc6=1 and it still triggers the hang check thingey as before. (without enabling rc6 it does not do that) > > [reply to comment 111] > > I tried to apply https://patchwork.kernel.org/patch/1896161/ on top of > > vanilla v3.7.1, but I could not get it to compile: > > Hmm, I have compiled it with 3.7.0 git kernel without a problem, some offset > while patching but that was all. I got some offset issues too, but it failed to compile at all because the function was not defined. Did you really apply the patch to the right source tree?
(In reply to comment #115) > (In reply to comment #114) > > Yes, it is xf86-video-intel. I am running it with the current version of > > Debian/sid + this patch now (2.20.14-1 + the patch) > Is it a standalone patch or does it depend on the former DRM patch? Anyway, > I have tested it with 3.7.1 + i915.i915_enable_rc6=1 and it still triggers > the hang check thingey as before. (without enabling rc6 it does not do that) the rc6 needs to be disabled *in*any*case*, that is known by now. And it is a standalone patch of xf86-video-intel. Did you recompile it? > > > [reply to comment 111] > > > I tried to apply https://patchwork.kernel.org/patch/1896161/ on top of > > > vanilla v3.7.1, but I could not get it to compile: > > > > Hmm, I have compiled it with 3.7.0 git kernel without a problem, some offset > > while patching but that was all. > I got some offset issues too, but it failed to compile at all because the > function was not defined. Did you really apply the patch to the right source > tree? Huuu? Are you sure? I checked my git commit log and it is kernel 3.7 tag of Linus, then the merge into my git repo, and then the patch. Nothing else. I guess you are running something else. Norbert
(In reply to comment #116) > the rc6 needs to be disabled *in*any*case*, that is known by now. > And it is a standalone patch of xf86-video-intel. Did you recompile it? With rc6 disabled I cannot trigger the bug. Yes, I recompiled and restarted X. > > > > [reply to comment 111] > > > > I tried to apply https://patchwork.kernel.org/patch/1896161/ on top of > > > > vanilla v3.7.1, but I could not get it to compile: > > > > > > Hmm, I have compiled it with 3.7.0 git kernel without a problem, some offset > > > while patching but that was all. > > I got some offset issues too, but it failed to compile at all because the > > function was not defined. Did you really apply the patch to the right source > > tree? > > Huuu? Are you sure? I checked my git commit log and it is kernel 3.7 tag of > Linus, then the merge into my git repo, and then the patch. Nothing else. > > I guess you are running something else. $ grep -rn __drm_mm_hole_node_end (empty) $ git log -S __drm_mm_hole_node_end (empty) $ git describe v3.7.1 (using remote git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git, but git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git is in the same directory) CONFIG_DRM=m (obvious..) and I cannot see any #ifdefs that exclude that file/function. Can you check your compile logs/flags to be sure?
[/usr/src/git-kernel/linux-2.6] grep -rn drm_mm_hole_node_end Binary file drivers/gpu/drm/drm.ko matches ... drivers/gpu/drm/drm_mm.c:126: unsigned long hole_end = drm_mm_hole_node_end(hole_node); drivers/gpu/drm/drm_mm.c:210: unsigned long hole_end = drm_mm_hole_node_end(hole_node); ...
(In reply to comment #118) > [/usr/src/git-kernel/linux-2.6] grep -rn drm_mm_hole_node_end > Binary file drivers/gpu/drm/drm.ko matches > ... I saw that, but I have a symbol with a __ prefix. Looking at the comments, there are two versions of that patch. comment 105 and comment 111 (v3). I guess you applied the earlier one? Chris, can you comment on this?
(In reply to comment #119) > I saw that, but I have a symbol with a __ prefix. Aren't these created by the compiler ??? > Looking at the comments, there are two versions of that patch. comment 105 > and comment 111 (v3). I guess you applied the earlier one? Chris, can you Indeed ... indeed ... I don't know why, but checking not the git log I wrote, but the actual diff I see that it is the patchwork patch from 105 ... ok, trying the other one now... Norbert
(In reply to comment #120) > > Looking at the comments, there are two versions of that patch. comment 105 > > and comment 111 (v3). I guess you applied the earlier one? Chris, can you > > Indeed ... indeed ... I don't know why, but checking not the git log I > wrote, but the actual diff I see that it is the patchwork patch from 105 ... > ok, trying the other one now... Indeed, patchwork 1896161 from comment 111 does not compile. Norbert
Created attachment 72022 [details] [review] Only evict the blocks required to free the hole (In reply to comment #121) > Indeed, patchwork 1896161 from comment 111 does not compile. It's just based on a slightly more recent tree. Meta-patch: s/__drm/drm/
(In reply to comment #122) > It's just based on a slightly more recent tree. Meta-patch: s/__drm/drm/ Thanks, rebooting now with new kernel (and still patched intel xf86 driver)
I am unable to reproduce any hang with rc6 disabled. Chris, should I test it with rc6 enabled? Norbert, do you have a reliable test-case? My sw/hw details: - Distro: Arch Linux x86_64 (with testing repos enabled) - DDX: xf86-video-intel 2.20.16 on Xorg 1.13.1 - KDE 4.9.4 as desktop environment, KWin uses OpenGL compositing - Kernel: 3.7.1 (config https://raw.github.com/Lekensteyn/aur/master/linux-custom/config + watchdog patch) - SSD: Intel 320 80G (/boot + LUKS-encrypted filesystem) - CPU: i5-460M - RAM: 8G My rc6-enabled hang could be triggered by copying a Linux source from the SSD tree a few times to tmpfs (without hitting OOM).
(In reply to comment #124) > I am unable to reproduce any hang with rc6 disabled. Chris, should I test it > with rc6 enabled? One bug at a time! ;-) The rc6=0 bug seems much more nasty as it is affecting gen4/gen5 and seems to imply a deep underlying synchronisation issue (so could actually be all architectures, just more prevalent on gen4/5). The rc6=1 ilk bug really does look like a missing hw workaround, dying between a series of flushes on the render ring is a good indication that our hw interaction is at fault. rc6 is disabled by default on ilk because we had not yet managed to make it work reliably, that it still doesn't work makes it a lower priority bug to chase. However, given the closely linked bisection, fixing the common problem may indeed make the rc6 bug harder to trigger.
(In reply to comment #122) > Created attachment 72022 [details] [review] [review] > Only evict the blocks required to free the hole > > (In reply to comment #121) > > Indeed, patchwork 1896161 from comment 111 does not compile. > > It's just based on a slightly more recent tree. Meta-patch: s/__drm/drm/ Ok. seems to work very stable now. I am running now a few days with the patches xf86-intel driver and this patch (#122), and didn't have any hiccups at all. Thanks a lot Norbert
xf86-video-intel commit 736b89504a32239a0c7dfb5961c1b8292dd744bd Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Sun Dec 30 10:32:18 2012 +0000 uxa: Align surface allocations to even tile rows Align surface sizes to an even number of tile rows to cater for sampler prefetch. If we read beyond the last page we may catch the PTE in a state of flux and trigger a GPU hang. Also detected by enabling invalid PTE access checking. References: https://bugs.freedesktop.org/show_bug.cgi?id=56916 References: https://bugs.freedesktop.org/show_bug.cgi?id=55984 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk
(In reply to comment #126) > Ok. seems to work very stable now. I am running now a few days with the > patches xf86-intel driver and this patch (#122), and didn't have any hiccups > at all. I can confirm this. Very stable now. Thinkpad X220 (SNB), Linux 3.7.1 (w/ patch), intel 2.20.17 (SNA)
(In reply to comment #126) > Ok. seems to work very stable now. I am running now a few days with the > patches xf86-intel driver and this patch (#122), and didn't have any hiccups > at all. Ok, I got one. It took a *long* time, and I (stupidly) didn't grab the error state (too tired after 24 hours of flight/travel), but it definitely happened while running a apt-get upgrade after a long time, while at the same time browsing photos from my travel. Let's say, the situation as wastely improved, but is not perfect. I will try to get the error state as soon as I hit it again. Norbert
Created attachment 72697 [details] [review] Longshot 1: remove g4x/g5 specific MI_FLUSH This bit is described as disabling the reload of indirect state pointers from the context. Since we aren't using any of that, toggling this bit shouldn't do anything. However, it is g4x/g5 specific...
Created attachment 72698 [details] [review] Longshot 2: make the shrinker less aggressive towards instruction bo A variation on the shrinker, with the theory being that it is the kernel / instruction state that is being corrupted by the rebinding.
Hi everyone, (In reply to comment #129) > (In reply to comment #126) > > Ok. seems to work very stable now. I am running now a few days with the > > patches xf86-intel driver and this patch (#122), and didn't have any hiccups > > at all. > > Ok, I got one. It took a *long* time, and I (stupidly) didn't grab the error > state (too tired after 24 hours of flight/travel), but it definitely > happened while running a apt-get upgrade after a long time, while at the > same time browsing photos from my travel. > > Let's say, the situation as wastely improved, but is not perfect. I will try > to get the error state as soon as I hit it again. Actually trying it over, I see the following: 3.7.0 plus patch from #122 with patched intel driver is rock solid 3.8.0-rc2 with patched intel driver (but no kernel patch) hangs (uploading the error state file soon) I will try 3.8.0-rc3 with #122 patch plus the two from Chris 130/131 now. Norbert
Created attachment 72770 [details] i915 error state, 3.8.0-rc2, no patches
Everyone please retest with latest drm-intel-fixes from http://cgit.freedesktop.org/~danvet/drm-intel I've just merged a bunch of duct-tapes for this issue. For those who can only reproduce the hangs with rc6 enabled, please also try reenabling that with i915.i915_enable_rc6=1.
Created attachment 72847 [details] [review] Hang me So now the workaround is upstream, we need to find a way to retrigger the bug... This patch causes us to unbind everything after each batch - but it also causes execution to be serialised. So the timing is going to be completely different versus the IO related hangs... We might try evicting before the batch instead.
*** Bug 59280 has been marked as a duplicate of this bug. ***
> 3.7.0 plus patch from #122 with patched intel driver is rock solid > 3.8.0-rc2 with patched intel driver (but no kernel patch) hangs (uploading > the error state file soon) > > I will try 3.8.0-rc3 with #122 patch plus the two from Chris 130/131 now. 3.8.0-rc3 with patches from #122, #130, #131 seems to be very stable again. Concerning the other two requests: Since testing on *absence* of the bug always takes me a few days until I am convinced that it does not appear, which of the two #134 or #135 should I try next, taking into account that the patches from 122,130,131 do work out in some way. Thanks Norbert
*** Bug 56916 has been marked as a duplicate of this bug. ***
*** Bug 57122 has been marked as a duplicate of this bug. ***
*** Bug 57136 has been marked as a duplicate of this bug. ***
Created attachment 73082 [details] [review] Drop caches I can reproduce this using the attached patch and UXA on ilk: $ while sleep .5; do echo 15 > /sys/kernel/debug/dri/0/i915_gem_drop_caches ; done & $ DISPLAY=:0 CAIRO_TEST_TARGET=xlib ./cairo-perf-trace -i6 cairo-traces/benchmark/firefox-fishtank.trace Dies in mere seconds.
Created attachment 73105 [details] [review] Invalidate the presumed_offsets along the slow relocation path
commit 262b6d363fcff16359c93bd58c297f961f6e6273 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Jan 15 16:17:54 2013 +0000 drm/i915: Invalidate the relocation presumed_offsets along the slow path
A patch referencing this bug report has been merged in Linux v3.8-rc4: commit 93927ca52a55c23e0a6a305e7e9082e8411ac9fa Author: Daniel Vetter <daniel.vetter@ffwll.ch> Date: Thu Jan 10 18:03:00 2013 +0100 drm/i915: Revert shrinker changes from "Track unbound pages"
A patch referencing this bug report has been merged in Linux v3.8-rc4: commit 901593f2bf221659a605bdc1dcb11376ea934163 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Wed Dec 19 16:51:06 2012 +0000 drm: Only evict the blocks required to create the requested hole
I can confirm I have not experienced the bug with drm-intel-next after several days of testing. What's the procedure next? Are the patches going to be backported to 3.7.x?
(In reply to comment #146) > I can confirm I have not experienced the bug with drm-intel-next after > several days of testing. What's the procedure next? Are the patches going to > be backported to 3.7.x? The band-aid is backported already afaik, the real fix should show up in the next 3.7.x point release (currently under -stable review).
Ah, thanks, great! Now I see that the two patches from Comment #144 and Comment #145 are included in 3.7.3. So I guess the real fix you're talking about it the one from Comment #143.
(In reply to comment #148) > Ah, thanks, great! Now I see that the two patches from Comment #144 and > Comment #145 are included in 3.7.3. So I guess the real fix you're talking > about it the one from Comment #143. Yep, that's right.
Closing.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.