Bug 55984 - [ilk regression] gpu hangs on ironlake with 3.6 + -next + -fixes code
[ilk regression] gpu hangs on ironlake with 3.6 + -next + -fixes code
Status: RESOLVED FIXED
Product: DRI
Classification: Unclassified
Component: DRM/Intel
unspecified
Other All
: medium normal
Assigned To: Intel GFX Bugs mailing list
Intel GFX Bugs mailing list
:
: 56916 57122 57136 59280 (view as bug list)
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2012-10-14 23:47 UTC by Dave Airlie
Modified: 2013-01-25 16:42 UTC (History)
19 users (show)

See Also:


Attachments
error state (1.40 MB, text/plain)
2012-10-14 23:47 UTC, Dave Airlie
no flags Details
another error state (190.21 KB, application/octet-stream)
2012-11-02 06:40 UTC, Dave Airlie
no flags Details
i915_error_state.txt.gz (204.22 KB, application/x-gzip)
2012-11-03 15:55 UTC, Peter Wu
no flags Details
Xorg.0.log (35.37 KB, text/plain)
2012-11-03 16:00 UTC, Peter Wu
no flags Details
disable unmappable (511 bytes, patch)
2012-11-06 10:08 UTC, Daniel Vetter
no flags Details | Splinter Review
i915_error_state.gz with sledgehammer patch (182.17 KB, application/x-gzip)
2012-11-06 11:09 UTC, Peter Wu
no flags Details
i915_error_state.txt.gz on ilk-wa-pile 6ef21d3 + sledgehammer (176.37 KB, application/x-gzip)
2012-11-06 17:19 UTC, Peter Wu
no flags Details
i915_error_state.txt.gz ilk-wa-pipe 6ef21d3 + sledgehammer + ring flush (209.04 KB, application/x-gzip)
2012-11-06 20:31 UTC, Peter Wu
no flags Details
ilk-wa-pipe + sledgehammer + ring flush (193.44 KB, application/x-gzip)
2012-11-07 08:40 UTC, Norbert Preining
no flags Details
i915 error state with ilk-wa-pipe + sledgehammer + ring flush (another hang) (198.02 KB, application/x-gzip)
2012-11-08 23:31 UTC, Norbert Preining
no flags Details
i915_error_state.txt.gz ickle/linux-2.6 fastboot with sledgehammer + ring flush (180.02 KB, application/x-gzip)
2012-11-09 18:48 UTC, Peter Wu
no flags Details
i915 error state, i915_enable_rc6=0, rc4 + ilk-wa-pipe + sledgehammer + ring flush (192.84 KB, application/x-gzip)
2012-11-11 13:41 UTC, Norbert Preining
no flags Details
disable unbound tracking (1.25 KB, patch)
2012-11-15 13:15 UTC, Daniel Vetter
no flags Details | Splinter Review
always do set-gtt-domain (2.33 KB, patch)
2012-11-15 13:45 UTC, Chris Wilson
no flags Details | Splinter Review
always do set-gtt-domain (2.78 KB, patch)
2012-11-15 21:13 UTC, Chris Wilson
no flags Details | Splinter Review
disable cpu relocs completely (543 bytes, patch)
2012-11-16 18:24 UTC, Daniel Vetter
no flags Details | Splinter Review
dmesg/error_state on ilk/drm-intel-nightly/ubuntu 12.10 (327.54 KB, application/x-gzip)
2012-11-17 14:30 UTC, Imre Deak
no flags Details
dmesg from 3.6.0-rc2-git-87-g0327d6b no patches (58.28 KB, text/plain)
2012-11-17 17:14 UTC, Peter Wu
no flags Details
dmesg from 3.7.0-rc5+ (61.57 KB, text/plain)
2012-11-18 12:20 UTC, Norbert Preining
no flags Details
use dma32 for gem bo allocations (535 bytes, patch)
2012-11-19 09:52 UTC, Daniel Vetter
no flags Details | Splinter Review
kernel config used for 3.7.x (93.91 KB, text/plain)
2012-11-19 18:33 UTC, Peter Wu
no flags Details
config-3.7.0-rc4-g6283022 (89.30 KB, text/plain)
2012-11-19 21:07 UTC, Imre Deak
no flags Details
Norbert's kernel config (81.86 KB, text/plain)
2012-11-19 22:21 UTC, Norbert Preining
no flags Details
bug hit with patches 3.7-rc6 plus patches from #43 and #46, but not #59 (190.17 KB, application/x-gzip)
2012-11-20 07:39 UTC, Norbert Preining
no flags Details
Script to trigger the bug. (488 bytes, text/plain)
2012-11-22 13:04 UTC, Mika Kuoppala
no flags Details
Don't force GTT/CPU relocations (7.18 KB, patch)
2012-11-26 09:53 UTC, Chris Wilson
no flags Details | Splinter Review
i915 error state, rc6=0, patch from comment 79 (197.10 KB, application/x-gzip)
2012-12-01 05:52 UTC, Norbert Preining
no flags Details
Keep reserved objects pinned until after reloction processing. (2.47 KB, patch)
2012-12-13 10:37 UTC, Chris Wilson
no flags Details | Splinter Review
Keep reserved objects pinned until after reloction processing. (2.85 KB, patch)
2012-12-13 10:54 UTC, Chris Wilson
no flags Details | Splinter Review
i915_error_state and Xorg log for 3.7.0+keep reserved pinned patch (139.32 KB, application/octet-stream)
2012-12-14 20:31 UTC, Brad Jackson
no flags Details
i915_error_state.txt.gz (215.71 KB, application/x-gzip)
2012-12-17 13:28 UTC, Imre Deak
no flags Details
make the shrinker less aggressive (2.18 KB, patch)
2012-12-19 13:40 UTC, Daniel Vetter
no flags Details | Splinter Review
Mark unused portions of the GTT as invalid (4.33 KB, patch)
2012-12-21 12:52 UTC, Chris Wilson
no flags Details | Splinter Review
Align surface sizes to an even tile row (839 bytes, patch)
2012-12-21 13:52 UTC, Chris Wilson
no flags Details | Splinter Review
Only evict the blocks required to free the hole (4.73 KB, patch)
2012-12-23 11:40 UTC, Chris Wilson
no flags Details | Splinter Review
Longshot 1: remove g4x/g5 specific MI_FLUSH (848 bytes, patch)
2013-01-09 02:44 UTC, Chris Wilson
no flags Details | Splinter Review
Longshot 2: make the shrinker less aggressive towards instruction bo (2.18 KB, patch)
2013-01-09 02:46 UTC, Chris Wilson
no flags Details | Splinter Review
i915 error state, 3.8.0-rc2, no patches (198.48 KB, application/x-gzip)
2013-01-10 04:27 UTC, Norbert Preining
no flags Details
Hang me (737 bytes, patch)
2013-01-11 10:19 UTC, Chris Wilson
no flags Details | Splinter Review
Drop caches (5.05 KB, patch)
2013-01-15 13:14 UTC, Chris Wilson
no flags Details | Splinter Review
Invalidate the presumed_offsets along the slow relocation path (2.63 KB, patch)
2013-01-15 16:20 UTC, Chris Wilson
no flags Details | Splinter Review

Note You need to log in before you can comment on or make changes to this bug.
Description Dave Airlie 2012-10-14 23:47:26 UTC
Created attachment 68564 [details]
error state

Okay I've seen both my ilk machines gpu hang over the weekend, and I've never seen them do it before.

I've got an error state from one at least, if its not in there then I suspect rc6.
Comment 1 Chris Wilson 2012-10-15 08:30:42 UTC
The hang looks pretty clean (no suspicious operations), more or less upon the transition from a 3D to BLT within a UXA batch buffer; rc6 requiring w/a would not surprise me.
Comment 2 Daniel Vetter 2012-10-15 13:35:46 UTC
Ok, I've hunted around in our docs a bit and found a few ilk w/as we don't implement. Or at least what I think we miss, given our sorry state of docs. Pushed out to

http://cgit.freedesktop.org/~danvet/drm/log/?h=ilk-wa-pile
Comment 3 Chris Wilson 2012-10-17 17:07:12 UTC
Also if you want to pin the blame on rc6, i915.i915_enable_rc6=0...
Comment 4 Dave Airlie 2012-10-29 21:34:10 UTC
okay got another death with rc6 disabled like Norbert.

took about 3-4 days this time.
Comment 5 Dave Airlie 2012-11-02 06:40:32 UTC
Created attachment 69415 [details]
another error state

[133200.848120] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[133200.848128] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
[133202.367409] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[133202.367692] [drm:i915_reset] *ERROR* GPU hanging too fast, declaring wedged!
[133202.367695] [drm:i915_reset] *ERROR* Failed to reset chip.

bits from dmesg.

this is 3.6.0 + -next + ilks wa, I'll try and start a bisect on it now, 4-5 days a hang, back in a few years
Comment 6 Chris Wilson 2012-11-02 08:57:32 UTC
That error-state is more consistent with a relocation failure than Norbert's - it fails trying to execute a composite operation within the middle of a batch.
Comment 7 Peter Wu 2012-11-03 15:55:39 UTC
Created attachment 69489 [details]
i915_error_state.txt.gz

Having seen https://lkml.org/lkml/2012/10/23/155 I think I am affected by the same bug. While I was compiling a kernel in a tmpfs, all of sudden KWin died. When I looked in dmesg, I saw:
[95597.708097] pci 0000:01:00.0: power state changed by ACPI to D3cold
[98683.176729] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[98683.176736] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
[98683.184252] [drm:init_ring_common] *ERROR* failed to set render ring head to zero ctl 00000000 head 69c191cc tail 00000000 start 00003000
[98683.240710] [drm:init_ring_common] *ERROR* render ring initialization failed ctl 0001f001 head 69c191cc tail 00000000 start 00003000
[98686.163041] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[98686.163202] [drm:i915_reset] *ERROR* GPU hanging too fast, declaring wedged!
[98686.163205] [drm:i915_reset] *ERROR* Failed to reset chip.

Attached is an i915_error_state from today, running 3.7-rc2-492-ge657e07. (only some ARM patches before 3.7-rc3).
I remember that I had exactly the same error message in a -testing branch on 3.6 (http://cgit.freedesktop.org/~danvet/drm-intel/tag/?h=drm-intel-testing&id=drm-intel-next-2012-09-20). I built that kernel on Sep 21 and it locked up on Sep 27 (no rebooting, just suspends). If you want a dmesg (nothing interesting) or logs/i915_error_state from that 3.6 kernel, let me know.

# lspci -vv -s 00:02.0
00:02.0 VGA compatible controller: Intel Corporation Core Processor Integrated Graphics Controller (rev 02) (prog-if 00 [VGA controller])                                                            
        Subsystem: CLEVO/KAPOK Computer Device 7130                                                                                                                                                  
        Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+                                                                                        
        Status: Cap+ 66MHz- UDF- FastB2B+ ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-                                                                                         
        Latency: 0                                                                                                                                                                                   
        Interrupt: pin A routed to IRQ 47
        Region 0: Memory at fd000000 (64-bit, non-prefetchable) [size=4M]
        Region 2: Memory at d0000000 (64-bit, prefetchable) [size=256M]
        Region 4: I/O ports at 1800 [size=8]
        Expansion ROM at <unassigned> [disabled]
        Capabilities: [90] MSI: Enable+ Count=1/1 Maskable- 64bit-
                Address: fee0f00c  Data: 4142
        Capabilities: [d0] Power Management version 2
                Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
                Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [a4] PCI Advanced Features
                AFCap: TP+ FLR+
                AFCtrl: FLR-
                AFStatus: TP-
        Kernel driver in use: i915
Comment 8 Peter Wu 2012-11-03 16:00:01 UTC
Created attachment 69490 [details]
Xorg.0.log

In Xorg, I only changed to use SNA instead of the default (UXA?).
Comment 9 Peter Wu 2012-11-05 20:41:04 UTC
In case it gets lost, I bisected the hang to:

504c7267a1e84b157cbd7e9c1b805e1bc0c2c846 is the first bad commit
commit 504c7267a1e84b157cbd7e9c1b805e1bc0c2c846
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Aug 23 13:12:52 2012 +0100

    drm/i915: Use cpu relocations if the object is in the GTT but not mappable
    
    This prevents the case of unbinding the object in order to process the
    relocations through the GTT and then rebinding it only to then proceed
    to use cpu relocations as the object is now in the CPU write domain. By
    choosing to use cpu relocations up front, we can therefore avoid the
    rebind penalty.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>

:040000 040000 090ed3d52b4f3210b988877f747b6ff86e123385 1d48be89ded4777a543b693db833de64877059c4 M      drivers
Comment 10 Daniel Vetter 2012-11-06 09:56:41 UTC
Ok, doesn't look like an rc6 thing, but very much like a regression.
Comment 11 Peter Wu 2012-11-06 10:04:56 UTC
Reverting that commit on top of 3.7-rc4 did not fix the hang issu. If you need any guinea pig for testing, here I am.
Comment 12 Daniel Vetter 2012-11-06 10:08:24 UTC
Created attachment 69604 [details] [review]
disable unmappable

Since right now we still have tons of signs pointing at unmappable gtt handling to be broken/non-coherent somehow, let's try this sledgehammer here and simply disable it all.
Comment 13 Peter Wu 2012-11-06 10:40:05 UTC
I applied that "sledgehammer" patch on 3.7-rc4, but the error persists. I saved dmesg and the i915_error_state file. If you need more information (or those logs), please give a call.
Comment 14 Daniel Vetter 2012-11-06 11:04:13 UTC
Can you please attach the new error_state with the sledgehammer? Maybe things shifted around enough to see what's going on ...
Comment 15 Peter Wu 2012-11-06 11:09:54 UTC
Created attachment 69615 [details]
i915_error_state.gz with sledgehammer patch
Comment 16 Chris Wilson 2012-11-06 16:52:28 UTC
(In reply to comment #15)
> Created attachment 69615 [details]
> i915_error_state.gz with sledgehammer patch

Note that this hang is slightly different again, closer to the one reported by Norbert, in that the hang is the HEAD didn't advanced into the batchbuffer as opposed to a hang within or after the batch.

So can you please try the hack in conjunction with the ilk-wa-pile?
Comment 17 Peter Wu 2012-11-06 17:19:44 UTC
Created attachment 69630 [details]
i915_error_state.txt.gz on ilk-wa-pile 6ef21d3 + sledgehammer

The issue still exists. Same errors in dmesg.
Comment 18 Chris Wilson 2012-11-06 17:44:17 UTC
Ok, next interesting observation is that your error states both have a double emission of the request seqno, so perhaps submitting that many PIPE_CONTROL in sequence is triggering an error? Can you please test, on top of everything else,

diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
index 3af1f2f..994d752 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.c
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
@@ -662,10 +662,11 @@ pc_render_add_request(struct intel_ring_buffer *ring,
 	 * incoherence by flushing the 6 PIPE_NOTIFY buffers out to
 	 * memory before requesting an interrupt.
 	 */
-	ret = intel_ring_begin(ring, 32);
+	ret = intel_ring_begin(ring, 34);
 	if (ret)
 		return ret;
 
+	intel_ring_emit(ring, MI_FLUSH);
 	intel_ring_emit(ring, GFX_OP_PIPE_CONTROL(4) | PIPE_CONTROL_QW_WRITE |
 			PIPE_CONTROL_WRITE_FLUSH |
 			PIPE_CONTROL_TEXTURE_CACHE_INVALIDATE);
@@ -691,6 +692,7 @@ pc_render_add_request(struct intel_ring_buffer *ring,
 	intel_ring_emit(ring, pc->gtt_offset | PIPE_CONTROL_GLOBAL_GTT);
 	intel_ring_emit(ring, seqno);
 	intel_ring_emit(ring, 0);
+	intel_ring_emit(ring, MI_FLUSH);
 	intel_ring_advance(ring);
 
 	*result = seqno;
Comment 19 Peter Wu 2012-11-06 20:31:01 UTC
Created attachment 69639 [details]
i915_error_state.txt.gz ilk-wa-pipe 6ef21d3 + sledgehammer + ring flush

The bug is still triggered.
Comment 20 Norbert Preining 2012-11-07 08:40:14 UTC
Created attachment 69656 [details]
ilk-wa-pipe + sledgehammer + ring flush

Same here, I got a hang with all the mentioned patches while compiling a big bunch of TeX Live. Error state is here now.
Comment 21 Chris Wilson 2012-11-07 08:52:44 UTC
ARGH!

Still it hangs in the middle of a series of requests (with no intervening batches or other operations). That should be impossible design wise, and improbable hardware wise.
Comment 22 Norbert Preining 2012-11-08 23:31:33 UTC
Created attachment 69781 [details]
i915 error state with ilk-wa-pipe + sledgehammer + ring flush (another hang)

Here is another hang with a different error state (at least to my eyes). Happened when running git checkout on a big repository. No other messages.
Comment 23 Peter Wu 2012-11-09 15:54:41 UTC
Is there anything to test? I mentioned before that this occurs when the memory is almost full. I have no swap, but 8GB RAM. Copied five times 1.2GiB (=6GiB total) to tmpfs (/dev/shm and /tmp).
Comment 24 Chris Wilson 2012-11-09 16:25:19 UTC
8GB machine (i3-330m) with no swap:

$ mount -ttmpfs -osize=100% none /tmp/wtf
$ while :; do yes wtf > /tmp/wtf/wtf; done &
$ sudo X -ac -noreset & while :; do x11perf -aa10text -d :0; done

with that I am able to repeatedly drive the machine to oom without triggering a GPU hang. Note I am using this set of patches on top of dinq: http://cgit.freedesktop.org/~ickle/linux-2.6/commit/?h=fastboot

Peter, is that close enough to your test case to trigger the bug, or do I need to tweak it slightly? Can you please also test with the patches in fastboot, in case there is an accidental fix?
Comment 25 Peter Wu 2012-11-09 18:48:51 UTC
Created attachment 69833 [details]
i915_error_state.txt.gz ickle/linux-2.6 fastboot with sledgehammer + ring flush

The bug still triggers, w/ and w/o the sledgehammer+ring flush patches.

The dmesg is now slightly different on the ickle/linux-2.6 fastboot branch:

[  501.214949] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[  501.214958] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
[  501.219393] [drm:init_ring_common] *ERROR* failed to set render ring head to zero ctl 00000000 head 09e16d8c tail 00000000 start 00300000
[  501.262795] [drm:intel_dp_aux_wait_done] *ERROR* dp aux hw did not signal timeout (has irq: 1)!
[  501.274784] [drm:init_ring_common] *ERROR* render ring initialization failed ctl 0001f001 head 09e16d8c tail 00000000 start 00300000
[  501.302762] [drm:intel_dp_aux_wait_done] *ERROR* dp aux hw did not signal timeout (has irq: 1)!
[  502.274145] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[  502.274293] [drm:i915_reset] *ERROR* GPU hanging too fast, declaring wedged!
[  502.274298] [drm:i915_reset] *ERROR* Failed to reset chip.


One thing that I now notice (because I did not try it before) is that switching to a text console (ctrl+alt+F1) gives me a black screen with some flashing large rectangles on screen (possibly the flashing cursor for the username)

Starting just X from a TTY and then running x11perf/glxgears/glxspheres with cp linux tree / yes / dd if=/dev/zero of=/tmp/wtf.. did not work. Even if OOM killed half my machine

I can only reproduce it after logging into KDE and running a GL program (like glxgears). x11perf does not trigger the bug, even in KDE. Maybe other (compositing) window managers work too, but I have not tested that.

I am using the below bash script after logging into KDE. After starting this script, I watch the kernel log (journalctl -f) and run `glxgears`.
#!/bin/bash
mkdir -p /tmp/wtf
mountpoint /tmp/wtf||sudo mount -osize=6200M -t tmpfs none /tmp/wtf
echo 15 > /proc/$$/oom_score_adj # just in case...
pids=
for i in /tmp/wtf/hang-{1..6}; do
        rm -rf "$i"
        #yes wtf > $i & # did not work
        cp -ra ~/Linux-src/linux "$i" &
        pids="$pids $!"
done
trap "kill $pids" EXIT
wait
Comment 26 Chris Wilson 2012-11-09 19:11:32 UTC
Peter, have you tested i915.i915_enable_rc6=0 (on top of the sledgehammer and w/a)? You have a most peculiar failure pattern where the GPU should be idle and then dies in a flush.
Comment 27 Daniel Vetter 2012-11-09 20:25:40 UTC
A similar bug on i965gm (bug #56916) mentions that things _only_ blow up when a mesa program is running. So everyone who can hit this, please reply with your exact mesa version and what (if any) GL programs you have running when this happens (GL compositor, ...). Also, those who can readily reproduce the hangs, please check whether stopping all GL clients (disable the compositor or use a non-GL one) prevents the hangs.
Comment 28 Peter Wu 2012-11-09 22:26:22 UTC
New results:
- with rc6 disabled, both 3.7-rc4 and wa+sledgehammer+ringflush does not expose the bug
- with rc6 not disabled (i.e. the default, -1), wa+sledgehammer+ringflush and GL compositing disabled in KWin, the bug is not trigerred. (in the same boot, GL compositing was enabled again and the bug shows up)

I am using the standard Mesa packages shipped with Arch Linux, that is 9.0.
The bug is triggered when KDE's KWin is active and glxgears is running. (instead of glxgears, I first tried glxspheres which triggers the bug too)
Comment 29 Chris Wilson 2012-11-10 08:19:21 UTC
Definitely looks like we have a pair of independent unresolved "cpu-relocs" and rc6 issues.
Comment 30 Daniel Vetter 2012-11-10 13:50:37 UTC
Ok, it looks like we have different bugs here, or at least non-overlapping sets of workarounds :(

Peter Wu, can you please check what happens when you manually enable rc6 on a 3.6 kernel?

Norbert Preining, test-results for your machine wrt rc6 vs. "mesa client/compositor running" vs. 3.6/3.7-rc would be really interesting, since iirc you can blow up your machine rather quickly, too.
Comment 31 Peter Wu 2012-11-11 10:42:09 UTC
3.6.6 w/o patches, w/ i915.i915_enable_rc6=1, w/ OpenGL compositing WM (KWin) and glxgears does *not* trigger the bug. I do get a very sluggish desktop which ultimately leads to some OOMs, but that is normal.

If it helps, I have tested the stock arch kernel config: https://projects.archlinux.org/svntogit/packages.git/tree/trunk/config.x86_64?h=packages/linux&id=89de8dc7df6894c219e746326ca338e9279c2e3f

and my own config: https://github.com/Lekensteyn/aur/blob/13feda6a55fb67c912c0611dc0c019bb084e7560/linux-custom/config
Comment 32 Norbert Preining 2012-11-11 11:50:03 UTC
(In reply to comment #30)
> Norbert Preining, test-results for your machine wrt rc6 vs. "mesa
> client/compositor running" vs. 3.6/3.7-rc would be really interesting, since
> iirc you can blow up your machine rather quickly, too.

I am running now with rc6 disabled and all the patches mentioned above. I am trying with Gnome3 and some GLX programs to see what I can do.

Norbert
Comment 33 Norbert Preining 2012-11-11 13:41:06 UTC
Created attachment 69900 [details]
i915 error state, i915_enable_rc6=0, rc4 + ilk-wa-pipe + sledgehammer + ring flush

As requested, here is another hang with rc6 disabled and the above patches.
Happened again when doing heavy photo viewing with quick switching in shotwell.

If you need other configurations or tests, please let me know

Norbert
Comment 34 Daniel Vetter 2012-11-11 13:51:30 UTC
(In reply to comment #33)
> Created attachment 69900 [details]
> i915 error state, i915_enable_rc6=0, rc4 + ilk-wa-pipe + sledgehammer + ring
> flush

Same hang as before on your machine between a rectlist PRIM and a BLT.

> As requested, here is another hang with rc6 disabled and the above patches.
> Happened again when doing heavy photo viewing with quick switching in
> shotwell.

To check: Is this with a GL client/compositor running?

> If you need other configurations or tests, please let me know

If the above is with a GL client, then trying to hang the box without any GL client/compositor running would be interesting.
Comment 35 Norbert Preining 2012-11-11 13:55:05 UTC
(In reply to comment #34)
> Same hang as before on your machine between a rectlist PRIM and a BLT.

Ok, at least repeatable ;-) So in my case rc6 does not make a change, former one was without any specific rc6 cmdline.

> To check: Is this with a GL client/compositor running?

Gnome3, so I guess there is a compositor running. 

> If the above is with a GL client, then trying to hang the box without any GL
> client/compositor running would be interesting.

Hmm, what WM could I use, guess I have to try fvwm back again. Will try in one way or the other.

Norbert
Comment 36 Chris Wilson 2012-11-13 16:48:13 UTC
Norbet, since you see a slightly different presentation of this bug, it would be useful if you could also test http://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=fastboot which despite its name also contains some work on the mb() around the relocations.
Comment 37 Norbert Preining 2012-11-13 22:32:38 UTC
(In reply to comment #36)
> http://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=fastboot which despite
> its name also contains some work on the mb() around the relocations.

Ok, there is one merge conflict with current kernel master, but I am trying to build the kernel now after fixing the conflict in one way (keeping the code).

I tried also to merge that with the ilk-pile but that was hopeless with loads of merge conflicts.

Will give feedback as soon as I can.

Norbert
Comment 38 Daniel Vetter 2012-11-14 09:30:11 UTC
Our QA discovered a random corruption issue (bug #56859) and bisected it to

commit 7f1290f2f2a4d2c3f1b7ce8e87256e052ca23125
Author: Jianguo Wu <wujianguo@huawei.com>
Date:   Mon Oct 8 16:33:06 2012 -0700

    mm: fix-up zone present pages

Can those who can reproduce this bug here easily please test whether reverting that commit changes anything?
Comment 39 Peter Wu 2012-11-14 18:55:16 UTC
Reverting that commit on top of 3.7-rc5-git-14-g9924a19 does not help.
Comment 40 Chris Wilson 2012-11-14 19:04:43 UTC
(just restoring priority so it doesn't fall out of our p1 lists)
Comment 41 Chris Wilson 2012-11-15 11:38:49 UTC
So one thing worth trying is:

diff --git a/drivers/char/agp/intel-gtt.c b/drivers/char/agp/intel-gtt.c
index 7ad6d13..6177daa 100644
--- a/drivers/char/agp/intel-gtt.c
+++ b/drivers/char/agp/intel-gtt.c
@@ -573,7 +573,7 @@ static int intel_gtt_init(void)
                return ret;
 
        intel_private.base.gtt_mappable_entries = intel_gtt_mappable_entries();
-       intel_private.base.gtt_total_entries = intel_gtt_total_entries();
+       intel_private.base.gtt_total_entries = intel_gtt_mappable_entries();
 
        /* save the PGETBL reg for resume */
        intel_private.PGETBL_save =

(It's a bit shotgun, but if it still continues to fail after that all the earlier symptoms have just been canaries.)
Comment 42 Daniel Vetter 2012-11-15 12:16:27 UTC
(In reply to comment #41)
> So one thing worth trying is:
> 
> diff --git a/drivers/char/agp/intel-gtt.c b/drivers/char/agp/intel-gtt.c
> index 7ad6d13..6177daa 100644
> --- a/drivers/char/agp/intel-gtt.c
> +++ b/drivers/char/agp/intel-gtt.c
> @@ -573,7 +573,7 @@ static int intel_gtt_init(void)
>                 return ret;
>  
>         intel_private.base.gtt_mappable_entries =
> intel_gtt_mappable_entries();
> -       intel_private.base.gtt_total_entries = intel_gtt_total_entries();
> +       intel_private.base.gtt_total_entries = intel_gtt_mappable_entries();
>  
>         /* save the PGETBL reg for resume */
>         intel_private.PGETBL_save =
> 
> (It's a bit shotgun, but if it still continues to fail after that all the
> earlier symptoms have just been canaries.)

Looks eerily similar to attachment #69604 [details] [review] i.e. has been tried, doesn't work on at least Peter's machine.
Comment 43 Daniel Vetter 2012-11-15 13:15:37 UTC
Created attachment 70111 [details] [review]
disable unbound tracking

Silly me just noticed that the unbound tracking has been merged into 3.7, not 3.6. This has a big enough impact to explain all kinds of things. Please try the attached patch, thanks.
Comment 44 Chris Wilson 2012-11-15 13:30:28 UTC
(In reply to comment #43)
> Created attachment 70111 [details] [review] [review]
> disable unbound tracking
> 
> Silly me just noticed that the unbound tracking has been merged into 3.7,
> not 3.6. This has a big enough impact to explain all kinds of things. Please
> try the attached patch, thanks.

i.e. i915_gem_object_set_to_cpu_domain(obj, true); on unbind which would more explicitly test the failure mechanism.
Comment 45 Chris Wilson 2012-11-15 13:45:15 UTC
Created attachment 70114 [details] [review]
always do set-gtt-domain

As a follow-on test, one of the areas where we short-circuit domain tracking that may be fouled up by not calling set-to-cpu-domain upon unbind.
Comment 46 Chris Wilson 2012-11-15 21:13:45 UTC
Created attachment 70142 [details] [review]
always do set-gtt-domain

Better patch, maybe a fix for something...
Comment 47 Norbert Preining 2012-11-15 22:29:52 UTC
(In reply to comment #46)
> Created attachment 70142 [details] [review] [review]
> always do set-gtt-domain
> 
> Better patch, maybe a fix for something...

On top of what should we try that? rc5 plain? rc5+ilk-pile? ...?
Only this patch or some others from this thread, too?

Thanks

Norbert
Comment 48 Daniel Vetter 2012-11-15 22:32:37 UTC
(In reply to comment #47)
> (In reply to comment #46)
> > Created attachment 70142 [details] [review] [review] [review]
> > always do set-gtt-domain
> > 
> > Better patch, maybe a fix for something...
> 
> On top of what should we try that? rc5 plain? rc5+ilk-pile? ...?
> Only this patch or some others from this thread, too?

Plain 3.7-rc kernel, just pick one that's broken ;-) Please also test the patch in comment #43 since that one tests a different theory.
Comment 49 Norbert Preining 2012-11-15 22:39:23 UTC
(In reply to comment #48)
> Plain 3.7-rc kernel, just pick one that's broken ;-) Please also test the
> patch in comment #43 since that one tests a different theory.

Ok, compiling now. Thanks
Comment 50 Chris Wilson 2012-11-15 22:51:05 UTC
(In reply to comment #48)
> Plain 3.7-rc kernel, just pick one that's broken ;-) Please also test the
> patch in comment #43 since that one tests a different theory.

#44 / #46 are both elements of #43... All 3 are worth testing independently.
Comment 51 Peter Wu 2012-11-15 23:26:54 UTC
Still broken with the same dmesg messages:

- 3.7.0-rc5-git-68-gc5e35d6 + disable unbound tracking
- 3.7.0-rc5-git-68-gc5e35d6 + always-do-set-to-gtt

Do you want me to add the i915_error_states?
Comment 52 Daniel Vetter 2012-11-16 14:57:33 UTC
Peter Wu, since you seem to have dug out the only bisect result (which didn't check out when reverting), can you please check whether the parent of the bad commit is working out for you perfectly well? Afaict this should be

commit 0327d6ba998ca181013a5a1709701a6532a41972
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sat Aug 11 15:41:06 2012 +0100

    drm/i915: Extract general object init routine

Pretty much all hairy changes in gem for 3.7 are before that commit, so knowing that things are solid with this sha1 would be rather helpful. So please beat on this extensively, thanks.
Comment 53 Daniel Vetter 2012-11-16 18:24:03 UTC
Created attachment 70168 [details] [review]
disable cpu relocs completely

I'm not completely sure, but I think we haven't ruled this one out yet. Please test, thanks.
Comment 54 Peter Wu 2012-11-16 18:58:05 UTC
Still affected:
- 3.6.0-rc2-git-87-g0327d6b no patches
- 3.7.0-rc5-git-68-gc5e35d6 + disable-cpu-relocs

Do I need to combine some patches? E.g. the disable-cpu-relocs with sledgehammer + ring flush?
Comment 55 Daniel Vetter 2012-11-17 10:55:14 UTC
To hunt down a few other theories, can everyone please attach the complete dmesg (doesn't really matter whether with drm.debug or not, kernel version also doesn't matter).
Comment 56 Imre Deak 2012-11-17 14:30:34 UTC
Created attachment 70188 [details]
dmesg/error_state on ilk/drm-intel-nightly/ubuntu 12.10
Comment 57 Peter Wu 2012-11-17 17:14:46 UTC
Created attachment 70192 [details]
dmesg from 3.6.0-rc2-git-87-g0327d6b no patches
Comment 58 Norbert Preining 2012-11-18 12:20:58 UTC
Created attachment 70214 [details]
dmesg from 3.7.0-rc5+

Here my dmesg from current running kernel.
I just returned from a travel and will try the patch from comment 53 on top of the patches in comments 43 and 46

Norbert
Comment 59 Daniel Vetter 2012-11-19 09:52:29 UTC
Created attachment 70248 [details] [review]
use dma32 for gem bo allocations

It's not very likely, but on the off chance that this helps, please test.
Comment 60 Daniel Vetter 2012-11-19 16:29:01 UTC
Ok, yet another new theory ... everyone please attach your kernel .config, thanks.
Comment 61 Peter Wu 2012-11-19 18:33:53 UTC
Created attachment 70271 [details]
kernel config used for 3.7.x

I haven't tested the patch from comment 59 yet, but here is my kernel config.
Comment 62 Imre Deak 2012-11-19 21:07:24 UTC
Created attachment 70279 [details]
config-3.7.0-rc4-g6283022
Comment 63 Norbert Preining 2012-11-19 22:21:16 UTC
Created attachment 70286 [details]
Norbert's kernel config
Comment 64 Norbert Preining 2012-11-20 07:39:30 UTC
Created attachment 70296 [details]
bug hit with patches  3.7-rc6 plus patches from #43 and #46, but not #59

Here is another hang when running the two patches 43 and 46. Immediately after that I got also a page alloc failure, here the syslog messages:
Nov 20 14:32:57 tofuschnitzel kernel: [55009.700562] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
Nov 20 14:32:57 tofuschnitzel kernel: [55009.700571] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
Nov 20 14:32:58 tofuschnitzel kernel: [55011.204741] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
Nov 20 14:32:58 tofuschnitzel kernel: [55011.204853] [drm:i915_reset] *ERROR* GPU hanging too fast, declaring wedged!
Nov 20 14:32:58 tofuschnitzel kernel: [55011.204858] [drm:i915_reset] *ERROR* Failed to reset chip.
Nov 20 14:33:58 tofuschnitzel kernel: [55071.390112] cat: page allocation failure: order:9, mode:0x2000d0
Nov 20 14:33:58 tofuschnitzel kernel: [55071.390158] Pid: 17244, comm: cat Not tainted 3.7.0-rc6+ #42
Nov 20 14:33:58 tofuschnitzel kernel: [55071.390188] Call Trace:
Nov 20 14:33:58 tofuschnitzel kernel: [55071.390214]  [<ffffffff81095abc>] warn_alloc_failed+0x10a/0x11e
Nov 20 14:33:58 tofuschnitzel kernel: [55071.390247]  [<ffffffff81096f44>] ? page_alloc_cpu_notify+0x3e/0x3e
Nov 20 14:33:58 tofuschnitzel kernel: [55071.390280]  [<ffffffff81096f55>] ? drain_local_pages+0x11/0x13
Nov 20 14:33:58 tofuschnitzel kernel: [55071.390313]  [<ffffffff81097de2>] __alloc_pages_nodemask+0x5a0/0x5e2
Nov 20 14:33:58 tofuschnitzel kernel: [55071.390351]  [<ffffffff810beaf0>] ____cache_alloc+0x2b5/0x544
Nov 20 14:33:58 tofuschnitzel kernel: [55071.390382]  [<ffffffff810beddd>] __kmalloc+0x5e/0x96
Nov 20 14:33:58 tofuschnitzel kernel: [55071.390413]  [<ffffffff810ddeba>] seq_read+0x1c3/0x324
Nov 20 14:33:58 tofuschnitzel kernel: [55071.390443]  [<ffffffff810c4d6a>] vfs_read+0x98/0xfa
Nov 20 14:33:58 tofuschnitzel kernel: [55071.390470]  [<ffffffff810c4e19>] sys_read+0x4d/0x7a
Nov 20 14:33:58 tofuschnitzel kernel: [55071.390500]  [<ffffffff814bb9d2>] system_call_fastpath+0x16/0x1b
Nov 20 14:33:58 tofuschnitzel kernel: [55071.390532] Mem-Info:
Nov 20 14:33:58 tofuschnitzel kernel: [55071.390547] DMA per-cpu:
Nov 20 14:33:58 tofuschnitzel kernel: [55071.390565] CPU    0: hi:    0, btch:   1 usd:   0
Nov 20 14:33:58 tofuschnitzel kernel: [55071.390591] CPU    1: hi:    0, btch:   1 usd:   0
Nov 20 14:33:58 tofuschnitzel kernel: [55071.390617] CPU    2: hi:    0, btch:   1 usd:   0
Nov 20 14:33:58 tofuschnitzel kernel: [55071.390643] CPU    3: hi:    0, btch:   1 usd:   0
Nov 20 14:33:58 tofuschnitzel kernel: [55071.390669] DMA32 per-cpu:
Nov 20 14:33:58 tofuschnitzel kernel: [55071.390686] CPU    0: hi:  186, btch:  31 usd:   0
Nov 20 14:33:58 tofuschnitzel kernel: [55071.390712] CPU    1: hi:  186, btch:  31 usd: 185
Nov 20 14:33:58 tofuschnitzel kernel: [55071.391920] CPU    2: hi:  186, btch:  31 usd:   0
Nov 20 14:33:58 tofuschnitzel kernel: [55071.393113] CPU    3: hi:  186, btch:  31 usd:   0
Nov 20 14:33:58 tofuschnitzel kernel: [55071.394320] Normal per-cpu:
Nov 20 14:33:58 tofuschnitzel kernel: [55071.395487] CPU    0: hi:  186, btch:  31 usd:   0
Nov 20 14:33:58 tofuschnitzel kernel: [55071.396654] CPU    1: hi:  186, btch:  31 usd: 152
Nov 20 14:33:58 tofuschnitzel kernel: [55071.397839] CPU    2: hi:  186, btch:  31 usd:  28
Nov 20 14:33:58 tofuschnitzel kernel: [55071.398968] CPU    3: hi:  186, btch:  31 usd:   0
Nov 20 14:33:58 tofuschnitzel kernel: [55071.400072] active_anon:115042 inactive_anon:65045 isolated_anon:0
Nov 20 14:33:58 tofuschnitzel kernel: [55071.400072]  active_file:264259 inactive_file:405769 isolated_file:0
Nov 20 14:33:58 tofuschnitzel kernel: [55071.400072]  unevictable:22 dirty:7 writeback:0 unstable:0
Nov 20 14:33:58 tofuschnitzel kernel: [55071.400072]  free:58783 slab_reclaimable:45728 slab_unreclaimable:11528
Nov 20 14:33:58 tofuschnitzel kernel: [55071.400072]  mapped:15992 shmem:10670 pagetables:6769 bounce:0
Nov 20 14:33:58 tofuschnitzel kernel: [55071.400072]  free_cma:0
Nov 20 14:33:58 tofuschnitzel kernel: [55071.406629] DMA free:15748kB min:540kB low:672kB high:808kB active_anon:0kB inactive_anon:4kB active_file:120kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15672kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:20kB slab_unreclaimable:4kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Nov 20 14:33:58 tofuschnitzel kernel: [55071.410109] lowmem_reserve[]: 0 2925 3808 3808
Nov 20 14:33:58 tofuschnitzel kernel: [55071.411302] DMA32 free:172664kB min:103388kB low:129232kB high:155080kB active_anon:357940kB inactive_anon:99880kB active_file:879092kB inactive_file:1375920kB unevictable:16kB isolated(anon):0kB isolated(file):0kB present:2995364kB mlocked:16kB dirty:12kB writeback:0kB mapped:46008kB shmem:12988kB slab_reclaimable:140052kB slab_unreclaimable:11464kB kernel_stack:584kB pagetables:5324kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:18 all_unreclaimable? no
Nov 20 14:33:59 tofuschnitzel kernel: [55071.415174] lowmem_reserve[]: 0 0 883 883
Nov 20 14:33:59 tofuschnitzel kernel: [55071.416495] Normal free:46720kB min:31236kB low:39044kB high:46852kB active_anon:102228kB inactive_anon:160296kB active_file:177824kB inactive_file:247156kB unevictable:72kB isolated(anon):0kB isolated(file):0kB present:904960kB mlocked:72kB dirty:16kB writeback:0kB mapped:17960kB shmem:29692kB slab_reclaimable:42840kB slab_unreclaimable:34644kB kernel_stack:2592kB pagetables:21752kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
Nov 20 14:33:59 tofuschnitzel kernel: [55071.420657] lowmem_reserve[]: 0 0 0 0
Nov 20 14:33:59 tofuschnitzel kernel: [55071.422091] DMA: 3*4kB 3*8kB 0*16kB 1*32kB 3*64kB 3*128kB 3*256kB 2*512kB 3*1024kB 3*2048kB 1*4096kB = 15748kB
Nov 20 14:33:59 tofuschnitzel kernel: [55071.423560] DMA32: 2116*4kB 2105*8kB 5306*16kB 1052*32kB 192*64kB 39*128kB 13*256kB 10*512kB 1*1024kB 1*2048kB 0*4096kB = 172664kB
Nov 20 14:33:59 tofuschnitzel kernel: [55071.425055] Normal: 2764*4kB 1152*8kB 951*16kB 237*32kB 27*64kB 1*128kB 1*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 48768kB
Nov 20 14:33:59 tofuschnitzel kernel: [55071.426553] 695040 total pagecache pages
Nov 20 14:33:59 tofuschnitzel kernel: [55071.428009] 14351 pages in swap cache
Nov 20 14:33:59 tofuschnitzel kernel: [55071.429508] Swap cache stats: add 236320, delete 221969, find 63452/74380
Nov 20 14:33:59 tofuschnitzel kernel: [55071.431003] Free swap  = 9658608kB
Nov 20 14:33:59 tofuschnitzel kernel: [55071.432418] Total swap = 9905148kB
Nov 20 14:33:59 tofuschnitzel kernel: [55071.446640] 1015792 pages RAM
Nov 20 14:33:59 tofuschnitzel kernel: [55071.448108] 37280 pages reserved
Nov 20 14:33:59 tofuschnitzel kernel: [55071.449587] 1201194 pages shared
Nov 20 14:33:59 tofuschnitzel kernel: [55071.451024] 804246 pages non-shared
Nov 20 14:33:59 tofuschnitzel kernel: [55071.452459] SLAB: Unable to allocate memory on node 0 (gfp=0xd0)
Nov 20 14:33:59 tofuschnitzel kernel: [55071.454003]   cache: size-2097152, object size: 2097152, order: 9
Nov 20 14:33:59 tofuschnitzel kernel: [55071.455474]   node 0: slabs: 0/0, objs: 0/0, free: 0

Maybe that helps. Now I am running #43, #46, #59.
Comment 65 Norbert Preining 2012-11-20 08:44:48 UTC
Another hang with #43, #46, #59 patches. Is the i915 error state needed?

It always happens while I am doing heavy IO things. This time pbuilder/cowbuilder installation tests of 1Gb of new packages on Debian.

Norbert
Comment 66 Chris Wilson 2012-11-20 08:48:45 UTC
One question we haven't ask is whether this is a genuine hang or an unfortunate hangcheck? Can you please reproduce with i915.enable_hangcheck=0 and see if your machine locks up instead of reporting the hang?
Comment 67 Imre Deak 2012-11-20 09:29:40 UTC
For me bisection pointed to commit #6c085a72 - drm/i915: Track unbound pages. A couple of additional test runs at this and its parent commit proves this.

I'll try patches from comment #43-#46. Since those didn't fix the problem for Norbert it might be we have multiple issues.

Chris: running with enable_hangcheck=0 on #6c085a72 the machine locked up.
Comment 68 Chris Wilson 2012-11-21 15:30:39 UTC
Imre suggests that there is a possible fix in http://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=for-imre. Can people please try that branch and see if it does improve matters fort them?
Comment 69 Norbert Preining 2012-11-22 04:50:23 UTC
Hi Chris,

I am now running with  http://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=for-imre pulled into main linux git.

The first thing I realized that there are some very strange effects happening: I have docky (a panel) running, and set to auto-hide. If it is shown, then there are boxes around *some* of the icons there. And if the docky panel is going into hiding mode, then a nice green bar appears across my screen.

Is this a known problem? Should I report it somewhere? I have made two screenshots showing the effects, should I upload them here or somewhere else?

Now I do some testing with these patches

Norbert
Comment 70 Chris Wilson 2012-11-22 08:25:58 UTC
(In reply to comment #69)
> Hi Chris,
> 
> I am now running with 
> http://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=for-imre pulled into
> main linux git.
> 
> The first thing I realized that there are some very strange effects
> happening: I have docky (a panel) running, and set to auto-hide. If it is
> shown, then there are boxes around *some* of the icons there. And if the
> docky panel is going into hiding mode, then a nice green bar appears across
> my screen.
> 
> Is this a known problem? Should I report it somewhere? I have made two
> screenshots showing the effects, should I upload them here or somewhere else?

No, that's a little unexpected...But for now focus on the question whether the original hang is reproducible and I'll build a second tree with just the likely fixes.
Comment 71 Chris Wilson 2012-11-22 08:46:10 UTC
I've put a smaller selection of patches in http://cgit.freedesktop.org/~ickle/linux-2.6/commit/?h=bug55984. It's still a shotgun approach, but a good first step will be to see if it cures the hang..
Comment 72 Mika Kuoppala 2012-11-22 13:04:14 UTC
Created attachment 70428 [details]
Script to trigger the bug.

This seems to be quickest way to repro the bug on Ville's ILK
Comment 73 Norbert Preining 2012-11-23 04:51:41 UTC
(In reply to comment #72)
> Script to trigger the bug.
> 
> This seems to be quickest way to repro the bug on Ville's ILK

I couldn't trigger anything with that, it just happily continued for ages. Here it seems that big IO for read and write to be necessary, while here is only read.
Comment 74 Norbert Preining 2012-11-23 09:48:26 UTC
Running current linux HEAD with http://cgit.freedesktop.org/~ickle/linux-2.6/commit/?h=bug55984 pulled in I cannot trigger the crash, how hard I have tried now for some time.

Compared with the for-imre branch also the strange artifacts are gone, so it looks much better now.

I have still the following boot cmd line:
   i915.i915_enable_rc6=0 i915.enable_hangcheck=0

I will do more testing, of course

Norbert
Comment 75 Norbert Preining 2012-11-24 00:08:02 UTC
Okay, I guess I have to recall my statement, I didn't realize it at first. Due to the i915.enable_hangcheck=0 it seemss that not simply the 3d died and Gnome3 WM died, but with this the screen went black and didn't react on anything. Interestingly there were no messages at all in the log files.

I could SysRq the computer, and the logfiles showed activity, but the screen remain black.

I guess that is the sign of a hang.

Pity.
Comment 76 Chris Wilson 2012-11-24 19:47:56 UTC
Ok, after banging my head against this for several days, I have decreed that the death within the render ring (inside the sequence of FLUSH PIPE_CONTROLx8 FLUSH) is due to enabling of rc6 on ILK. That doesn't explain all the crashes, but it does explain the "immediate" crashes on danvet-ilk using Daniel's killscript.

The remaining crashes, where the GPU vanishes in mid-batch, are what we need to try and reproduce - and I hope the common factor between this bug and #57122, #56916, #57136.
Comment 77 Peter Wu 2012-11-25 13:55:14 UTC
With http://cgit.freedesktop.org/~ickle/linux-2.6/commit/?h=bug55984, the situation does not change, i.e. still lockup message and vanishing 3D capailities.

(this bug is still marked NEEDINFO, do you need more details?)
Comment 78 Chris Wilson 2012-11-25 19:06:10 UTC
The last bit of information we need is how to reproduce the non-rc6 related hangs - all the killscripts we've generated so far seem to hit the rc6 issue, afaik.

(The highest priority is used for internal bug tracking.)
Comment 79 Chris Wilson 2012-11-26 09:53:37 UTC
Created attachment 70577 [details] [review]
Don't force GTT/CPU relocations

Today's patch, please disable rc6 whilst testing.
Comment 80 Norbert Preining 2012-11-26 12:35:23 UTC
Chris, please let us know on top of what? On top of the bug55984 git branch you created earlier, or is that one not necessary?

Thanks
Comment 81 Chris Wilson 2012-11-26 12:45:59 UTC
(In reply to comment #80)
> Chris, please let us know on top of what? On top of the bug55984 git branch
> you created earlier, or is that one not necessary?

In isolation, so on top of 3.7-rc7 or drm-intel-fixes. The goal is to both understand the issue and develop a minimal patch in time for 3.7. So yesterday...
Comment 82 Chris Wilson 2012-11-29 08:55:31 UTC
No news... Has the instadeath gone, but the slow lingering death remains?
Comment 83 Norbert Preining 2012-11-29 11:09:02 UTC
Hi Chris,

> No news... Has the instadeath gone, but the slow lingering death remains?

Well, I am running the latest patch on top of git and try to trigger the bug, till now without success, though.

I cannot say more or less, at least the frequency has reduced.

Let me know if I can help more than just trying to trigger it again.

Norbert
Comment 84 Peter Wu 2012-11-29 14:53:14 UTC
i915.i915_enable_rc6=0 unables me to trigger the bug. With the patch applied on top of 3.7-rc7, the bug is still not exposed (as expected).
Comment 85 Norbert Preining 2012-12-01 05:52:54 UTC
Created attachment 70855 [details]
i915 error state, rc6=0, patch from comment 79

Hi Chris,

sorry to say it, but I got a hang today. In the background some update.mlocate etc was running, plus some git checkout of a big repository.

i915 error state uploaded.

Norbert
Comment 86 Chris Wilson 2012-12-01 09:37:54 UTC
Norbet, if you can reproduce it with SNA, due to the packing of the batchbuffer I can have much better idea of what is going on. The suspicion is definitely some dodgy state in the surface packet - but at this moment in time, it could even be an alignment issue that's been hidden by buffer layout. :|
Comment 87 Norbert Preining 2012-12-02 22:20:47 UTC
Hi Chris,

ok, will switch to SNA, although in one of the email threads predating this bug report I switched to SNA and then was told not to.

Anyway, trying to recreate it with SNA.

Norbert
Comment 88 Chris Wilson 2012-12-02 22:29:39 UTC
(In reply to comment #87)
> Hi Chris,
> 
> ok, will switch to SNA, although in one of the email threads predating this
> bug report I switched to SNA and then was told not to.

Dave didn't want to muddy the waters and make sure we make sure we understand the root cause of the regression with UXA; I'm trying to use it as a diagnostic.
Comment 89 Norbert Preining 2012-12-04 23:45:14 UTC
First report on SNA: I got a very very strange thing yesterday. After resuming from suspend-to-ram I continued working with GIMP on some graphics, and first it started with some blue flashes on the screen, and finally the screen got completely blue, but I could in principle still interact with the windows, just without seeing anything.

Restarting X (gdm) did fix it for me.

Doing more heavy IO to stress test SNA.

Norbert
Comment 90 Chris Wilson 2012-12-05 15:18:10 UTC
(In reply to comment #89)
> First report on SNA: I got a very very strange thing yesterday. After
> resuming from suspend-to-ram I continued working with GIMP on some graphics,
> and first it started with some blue flashes on the screen, and finally the
> screen got completely blue, but I could in principle still interact with the
> windows, just without seeing anything.

Interesting, certainly the first I've heard of such. Can you see if it is possible to capture it in a screenshot, or failing that a photograph, and please open a bug report for it.
Comment 91 Justin P. Mattock 2012-12-05 19:00:00 UTC
this bug is similar or the same as this..:
https://bugzilla.kernel.org/show_bug.cgi?id=49571

my bisect is pointing to the changes with device_cgroup.c

[66b8ef67756b3051bf42a077a82c3c5c279caa5b] device_cgroup: add "deny_all" in
dev_cgroup structure

I have two more revisions to test, but am sure they wont matter since they are for fat and kernel-doc
once I am done bisecting I will pull to the current Mainline and run it to see if the fixes in device_cgroup fix this for me, then will go from there.
Comment 92 Norbert Preining 2012-12-13 00:31:56 UTC
Hi Chris,

(In reply to comment #90)
> (In reply to comment #89)
> > First report on SNA: I got a very very strange thing yesterday. After
> > resuming from suspend-to-ram I continued working with GIMP on some graphics,
> > and first it started with some blue flashes on the screen, and finally the
> > screen got completely blue, but I could in principle still interact with the
> > windows, just without seeing anything.
> 
> Interesting, certainly the first I've heard of such. Can you see if it is
> possible to capture it in a screenshot, or failing that a photograph, and
> please open a bug report for it.

sorry for the late reply, I was on a trip.

Since it only happened once and on some rc kernels I leave it for now, but a screenshot is not necessary, it was simply monocolor blue ... nothing else ;-)

Another thing: I have now tried to hit the bug with SNA for a long time, without *any* success. As soon as I switched back to IXA the bug was triggered within short time.

Does this help you?

Norbert
Comment 93 Chris Wilson 2012-12-13 09:51:30 UTC
(In reply to comment #92) 
> Another thing: I have now tried to hit the bug with SNA for a long time,
> without *any* success. As soon as I switched back to IXA the bug was
> triggered within short time.
> 
> Does this help you?

As a null data-point, yes. My interpretation is then back towards a relocation/mm issue. (There is an outside chance that some of the surface alignment tweaks help, for which getting the SURFACE_STATE would have been useful...)

Norbert, a quick scan of the bug report doesn't yield any information as to whether you tested the mb() theory. Can you please try:

http://cgit.freedesktop.org/~ickle/linux-2.6 #master

Just compile the master branch (at 3.7-rc4).
Comment 94 Chris Wilson 2012-12-13 10:37:15 UTC
Created attachment 71437 [details] [review]
Keep reserved objects pinned until after reloction processing.

An idea at last: earlier objects are moved in order to perform the relocations. This should be impossible as we try to detect when we are going to require GTT access for relocation processing and reserve it in the right spot. However, the code change is minor and it should be easy enough to test...
Comment 95 Chris Wilson 2012-12-13 10:54:25 UTC
Created attachment 71439 [details] [review]
Keep reserved objects pinned until after reloction processing.
Comment 96 Brad Jackson 2012-12-13 16:23:05 UTC
The "Keep reserved objects pinned" patch on 3.7.0 does not fix the hang for me. The only thing I've seen so far that stops it is the last kernel I compiled for my bisect to bad commit 6c085a728cf000ac1865d66f8c9b52935558b328. I believe a couple others have also bisected to the same result.
Comment 97 Norbert Preining 2012-12-13 23:08:01 UTC
(In reply to comment #93)
> Norbert, a quick scan of the bug report doesn't yield any information as to
> whether you tested the mb() theory. Can you please try:
> 
> http://cgit.freedesktop.org/~ickle/linux-2.6 #master

Comment 74 and 75 seem to indicate that I tried at least a subset.
Do you want me to try it with the full branch?

Norbert
Comment 98 Chris Wilson 2012-12-13 23:11:33 UTC
(In reply to comment #97)
> (In reply to comment #93)
> > Norbert, a quick scan of the bug report doesn't yield any information as to
> > whether you tested the mb() theory. Can you please try:
> > 
> > http://cgit.freedesktop.org/~ickle/linux-2.6 #master
> 
> Comment 74 and 75 seem to indicate that I tried at least a subset.
> Do you want me to try it with the full branch?

Yes, they were testing for specific ideas. I'm back to trying a shotgun approach.
Comment 99 Chris Wilson 2012-12-14 00:24:30 UTC
(In reply to comment #96)
> The "Keep reserved objects pinned" patch on 3.7.0 does not fix the hang for
> me. The only thing I've seen so far that stops it is the last kernel I
> compiled for my bisect to bad commit
> 6c085a728cf000ac1865d66f8c9b52935558b328. I believe a couple others have
> also bisected to the same result.

Brad, can you please attach an i915_error_state from your hang and Xorg.log?
Comment 100 Brad Jackson 2012-12-14 20:31:29 UTC
Created attachment 71521 [details]
i915_error_state and Xorg log for 3.7.0+keep reserved pinned patch

Requested files
Comment 101 Chris Wilson 2012-12-14 21:00:04 UTC
(In reply to comment #100)
> Created attachment 71521 [details]
> i915_error_state and Xorg log for 3.7.0+keep reserved pinned patch

rc6 is disabled on Ironlake precisely because it is causing the lockup you are encountering. We already know that we are missing workarounds for enabling rc6 on Ironlake.
Comment 102 Florian Mickler 2012-12-15 03:31:48 UTC
A patch referencing this bug report has been merged in Linux v3.7-rc8:

commit 6567d748c4e94e3481e523803ec07ebd825c80d6
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sat Nov 10 10:00:06 2012 +0000

    Revert "drm/i915: enable rc6 on ilk again"
Comment 103 Imre Deak 2012-12-17 13:28:33 UTC
Created attachment 71651 [details]
i915_error_state.txt.gz

An easily reproducible hung, though might be unrelated to the one we are after here.
Comment 104 Chris Wilson 2012-12-17 13:43:25 UTC
(In reply to comment #103)
> Created attachment 71651 [details]
> i915_error_state.txt.gz
> 
> An easily reproducible hung, though might be unrelated to the one we are
> after here.

That's a spectacular broken mesa batch buffer. Its surface state base doesn't point anywhere near a buffer.
Comment 105 Daniel Vetter 2012-12-18 10:56:36 UTC
Please try out the patch at

https://patchwork.kernel.org/patch/1885411/

It has a decent chance to reduce gtt trashing, which might be good enough to again ducttape over the hangs. Or maybe change the pattern to be able to reproduce it much quicker. In any case, should be interesting ...
Comment 106 Brad Jackson 2012-12-18 17:42:12 UTC
I got the gpu hung error after 45 minutes on 3.7.1 with rc6=0 and the evict blocks patch applied.
Comment 107 Daniel Vetter 2012-12-19 13:40:33 UTC
Created attachment 71805 [details] [review]
make the shrinker less aggressive

Duct-tape solution if it is one, but imo very much worth a try.
Comment 108 Chris Wilson 2012-12-21 12:52:08 UTC
Created attachment 71926 [details] [review]
Mark unused portions of the GTT as invalid

Working on the theory that this is an invalid access beyond the end of a bo, this should hopefully enable GPU detection.
Comment 109 Chris Wilson 2012-12-21 13:52:04 UTC
Created attachment 71933 [details] [review]
Align surface sizes to an even tile row

And this is the complaint the GPU found. Please test this with your IO heavy workloads.
Comment 110 Norbert Preining 2012-12-22 07:28:20 UTC
Hi Daniel, hi Chris

(In reply to comment #107)
> Created attachment 71805 [details] [review] [review]
> make the shrinker less aggressive
> 
> Duct-tape solution if it is one, but imo very much worth a try.

There have been a lot of patches floating around, but I was running 3.7.0 plus this patch now for a while, and using UXA (*not* using SNA, with SNA it always was fine). 

I did not hit any problem till now, although I did heavy IO stuff as usual (svn up and git svn rebase on two 6+Gb repos), etc.

Chris: What should I do next? You have posted two patches (108 and 109 comments), should I try both, or each on independently?

Or is the information that patch from 107 is fine (till now) enough?

Please let me know, and thanks for your work on that, and Merry Christmas!

Norbert
Comment 111 Chris Wilson 2012-12-22 10:14:34 UTC
(In reply to comment #110)
> Hi Daniel, hi Chris
> 
> (In reply to comment #107)
> > Created attachment 71805 [details] [review] [review] [review]
> > make the shrinker less aggressive
> > 
> > Duct-tape solution if it is one, but imo very much worth a try.
> 
> There have been a lot of patches floating around, but I was running 3.7.0
> plus this patch now for a while, and using UXA (*not* using SNA, with SNA it
> always was fine). 
> 
> I did not hit any problem till now, although I did heavy IO stuff as usual
> (svn up and git svn rebase on two 6+Gb repos), etc.

Meh. As that patch is basically changing the ordering of the objects considered for shrinking, it just opens a can of worms - but it does seem to be a stopgap workaround.

> Chris: What should I do next? You have posted two patches (108 and 109
> comments), should I try both, or each on independently?

#108 is a means of provoking the GPU to spot a lot more errors, it needs a little more refinement to not first fallover on standard UXA behaviour.

I'd be interested in seeing if #109 has any effect at all (on stock 3.7.0 + uxa).

There is also https://patchwork.kernel.org/patch/1896161/ that would be useful to test (as it fixes a real bug and it would be cool if it was also an effective workaround here).
 
> Or is the information that patch from 107 is fine (till now) enough?

It is merely the start. :-p
Comment 112 Norbert Preining 2012-12-22 10:45:17 UTC
Hi,

I will try the patch from 109 in combination with the patchwork patch next. I think I *did* try the patchwork test recently.

So I will go silent now and report back in a few days if no problems arose, or immediately if it freezes again.

Thanks

Norbert
Comment 113 Peter Wu 2012-12-22 12:03:38 UTC
[reply to comment 107]
I haven't tried it yet, do you still want me to test it?

[reply to comment 108]
I cannot apply it on top of 3.7.1. What base do you want me to test it on?

[reply to comment 109]
Do I need to apply this to xf86-video-intel? If yes, which version/commit?

[reply to comment 111]
I tried to apply https://patchwork.kernel.org/patch/1896161/ on top of vanilla v3.7.1, but I could not get it to compile:
drivers/gpu/drm/drm_mm.c: In function ‘drm_mm_scan_remove_block’:
drivers/gpu/drm/drm_mm.c:612:3: error: implicit declaration of function ‘__drm_mm_hole_node_end’ [-Werror=implicit-function-declaration]

Did you mean drm_mm_hole_node_end?
Comment 114 Norbert Preining 2012-12-22 12:21:22 UTC
(In reply to comment #113)
> [reply to comment 109]
> Do I need to apply this to xf86-video-intel? If yes, which version/commit?

Yes, it is xf86-video-intel. I am running it with the current version of Debian/sid + this patch now (2.20.14-1 + the patch)

> [reply to comment 111]
> I tried to apply https://patchwork.kernel.org/patch/1896161/ on top of
> vanilla v3.7.1, but I could not get it to compile:

Hmm, I have compiled it with 3.7.0 git kernel without a problem, some offset while patching but that was all.

Norbert
Comment 115 Peter Wu 2012-12-22 14:10:18 UTC
(In reply to comment #114)
> Yes, it is xf86-video-intel. I am running it with the current version of
> Debian/sid + this patch now (2.20.14-1 + the patch)
Is it a standalone patch or does it depend on the former DRM patch? Anyway, I have tested it with 3.7.1 + i915.i915_enable_rc6=1 and it still triggers the hang check thingey as before. (without enabling rc6 it does not do that)

> > [reply to comment 111]
> > I tried to apply https://patchwork.kernel.org/patch/1896161/ on top of
> > vanilla v3.7.1, but I could not get it to compile:
> 
> Hmm, I have compiled it with 3.7.0 git kernel without a problem, some offset
> while patching but that was all.
I got some offset issues too, but it failed to compile at all because the function was not defined. Did you really apply the patch to the right source tree?
Comment 116 Norbert Preining 2012-12-22 15:11:42 UTC
(In reply to comment #115)
> (In reply to comment #114)
> > Yes, it is xf86-video-intel. I am running it with the current version of
> > Debian/sid + this patch now (2.20.14-1 + the patch)
> Is it a standalone patch or does it depend on the former DRM patch? Anyway,
> I have tested it with 3.7.1 + i915.i915_enable_rc6=1 and it still triggers
> the hang check thingey as before. (without enabling rc6 it does not do that)

the rc6 needs to be disabled *in*any*case*, that is known by now.
And it is a standalone patch of xf86-video-intel. Did you recompile it?

> > > [reply to comment 111]
> > > I tried to apply https://patchwork.kernel.org/patch/1896161/ on top of
> > > vanilla v3.7.1, but I could not get it to compile:
> > 
> > Hmm, I have compiled it with 3.7.0 git kernel without a problem, some offset
> > while patching but that was all.
> I got some offset issues too, but it failed to compile at all because the
> function was not defined. Did you really apply the patch to the right source
> tree?

Huuu? Are you sure? I checked my git commit log and it is kernel 3.7 tag of Linus, then the merge into my git repo, and then the patch. Nothing else.

I guess you are running something else.

Norbert
Comment 117 Peter Wu 2012-12-22 16:16:05 UTC
(In reply to comment #116)
> the rc6 needs to be disabled *in*any*case*, that is known by now.
> And it is a standalone patch of xf86-video-intel. Did you recompile it?
With rc6 disabled I cannot trigger the bug. Yes, I recompiled and restarted X.

> > > > [reply to comment 111]
> > > > I tried to apply https://patchwork.kernel.org/patch/1896161/ on top of
> > > > vanilla v3.7.1, but I could not get it to compile:
> > > 
> > > Hmm, I have compiled it with 3.7.0 git kernel without a problem, some offset
> > > while patching but that was all.
> > I got some offset issues too, but it failed to compile at all because the
> > function was not defined. Did you really apply the patch to the right source
> > tree?
> 
> Huuu? Are you sure? I checked my git commit log and it is kernel 3.7 tag of
> Linus, then the merge into my git repo, and then the patch. Nothing else.
> 
> I guess you are running something else.

$ grep -rn __drm_mm_hole_node_end
(empty)
$ git log -S __drm_mm_hole_node_end
(empty)
$ git describe
v3.7.1

(using remote git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-stable.git, but git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git is in the same directory)

CONFIG_DRM=m (obvious..) and I cannot see any #ifdefs that exclude that file/function. Can you check your compile logs/flags to be sure?
Comment 118 Norbert Preining 2012-12-22 23:11:20 UTC
[/usr/src/git-kernel/linux-2.6] grep -rn drm_mm_hole_node_end
Binary file drivers/gpu/drm/drm.ko matches
...
drivers/gpu/drm/drm_mm.c:126:	unsigned long hole_end = drm_mm_hole_node_end(hole_node);
drivers/gpu/drm/drm_mm.c:210:	unsigned long hole_end = drm_mm_hole_node_end(hole_node);
...
Comment 119 Peter Wu 2012-12-22 23:19:16 UTC
(In reply to comment #118)
> [/usr/src/git-kernel/linux-2.6] grep -rn drm_mm_hole_node_end
> Binary file drivers/gpu/drm/drm.ko matches
> ...
I saw that, but I have a symbol with a __ prefix.

Looking at the comments, there are two versions of that patch. comment 105 and comment 111 (v3). I guess you applied the earlier one? Chris, can you comment on this?
Comment 120 Norbert Preining 2012-12-22 23:31:31 UTC
(In reply to comment #119)
> I saw that, but I have a symbol with a __ prefix.

Aren't these created by the compiler ???

> Looking at the comments, there are two versions of that patch. comment 105
> and comment 111 (v3). I guess you applied the earlier one? Chris, can you

Indeed ... indeed ... I don't know why, but checking not the git log I wrote, but the actual diff I see that it is the patchwork patch from 105 ... ok, trying the other one now...

Norbert
Comment 121 Norbert Preining 2012-12-22 23:39:12 UTC
(In reply to comment #120)
> > Looking at the comments, there are two versions of that patch. comment 105
> > and comment 111 (v3). I guess you applied the earlier one? Chris, can you
> 
> Indeed ... indeed ... I don't know why, but checking not the git log I
> wrote, but the actual diff I see that it is the patchwork patch from 105 ...
> ok, trying the other one now...

Indeed, patchwork 1896161 from comment 111 does not compile.

Norbert
Comment 122 Chris Wilson 2012-12-23 11:40:33 UTC
Created attachment 72022 [details] [review]
Only evict the blocks required to free the hole

(In reply to comment #121)
> Indeed, patchwork 1896161 from comment 111 does not compile.

It's just based on a slightly more recent tree. Meta-patch: s/__drm/drm/
Comment 123 Norbert Preining 2012-12-23 11:53:39 UTC
(In reply to comment #122)
> It's just based on a slightly more recent tree. Meta-patch: s/__drm/drm/

Thanks, rebooting now with new kernel (and still patched intel xf86 driver)
Comment 124 Peter Wu 2012-12-23 13:38:03 UTC
I am unable to reproduce any hang with rc6 disabled. Chris, should I test it with rc6 enabled?
Norbert, do you have a reliable test-case? My sw/hw details:

- Distro: Arch Linux x86_64 (with testing repos enabled)
- DDX: xf86-video-intel 2.20.16 on Xorg 1.13.1
- KDE 4.9.4 as desktop environment, KWin uses OpenGL compositing
- Kernel: 3.7.1 (config https://raw.github.com/Lekensteyn/aur/master/linux-custom/config + watchdog patch)
- SSD: Intel 320 80G (/boot + LUKS-encrypted filesystem)
- CPU: i5-460M
- RAM: 8G

My rc6-enabled hang could be triggered by copying a Linux source from the SSD tree a few times to tmpfs (without hitting OOM).
Comment 125 Chris Wilson 2012-12-23 14:22:07 UTC
(In reply to comment #124)
> I am unable to reproduce any hang with rc6 disabled. Chris, should I test it
> with rc6 enabled?

One bug at a time! ;-)

The rc6=0 bug seems much more nasty as it is affecting gen4/gen5 and seems to imply a deep underlying synchronisation issue (so could actually be all architectures, just more prevalent on gen4/5). The rc6=1 ilk bug really does look like a missing hw workaround, dying between a series of flushes on the render ring is a good indication that our hw interaction is at fault. rc6 is disabled by default on ilk because we had not yet managed to make it work reliably, that it still doesn't work makes it a lower priority bug to chase. However, given the closely linked bisection, fixing the common problem may indeed make the rc6 bug harder to trigger.
Comment 126 Norbert Preining 2012-12-26 03:11:51 UTC
(In reply to comment #122)
> Created attachment 72022 [details] [review] [review]
> Only evict the blocks required to free the hole
> 
> (In reply to comment #121)
> > Indeed, patchwork 1896161 from comment 111 does not compile.
> 
> It's just based on a slightly more recent tree. Meta-patch: s/__drm/drm/


Ok. seems to work very stable now. I am running now a few days with the patches xf86-intel driver and this patch (#122), and didn't have any hiccups at all.

Thanks a lot

Norbert
Comment 127 Chris Wilson 2012-12-30 10:39:00 UTC
xf86-video-intel commit 736b89504a32239a0c7dfb5961c1b8292dd744bd
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sun Dec 30 10:32:18 2012 +0000

    uxa: Align surface allocations to even tile rows
    
    Align surface sizes to an even number of tile rows to cater for sampler
    prefetch. If we read beyond the last page we may catch the PTE in a
    state of flux and trigger a GPU hang. Also detected by enabling invalid
    PTE access checking.
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=56916
    References: https://bugs.freedesktop.org/show_bug.cgi?id=55984
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk
Comment 128 Jan Alexander Steffens (heftig) 2013-01-01 02:01:07 UTC
(In reply to comment #126)
> Ok. seems to work very stable now. I am running now a few days with the
> patches xf86-intel driver and this patch (#122), and didn't have any hiccups
> at all.

I can confirm this. Very stable now.

Thinkpad X220 (SNB), Linux 3.7.1 (w/ patch), intel 2.20.17 (SNA)
Comment 129 Norbert Preining 2013-01-05 02:18:50 UTC
(In reply to comment #126)
> Ok. seems to work very stable now. I am running now a few days with the
> patches xf86-intel driver and this patch (#122), and didn't have any hiccups
> at all.

Ok, I got one. It took a *long* time, and I (stupidly) didn't grab the error state (too tired after 24 hours of flight/travel), but it definitely happened while running a apt-get upgrade after a long time, while at the same time browsing photos from my travel.

Let's say, the situation as wastely improved, but is not perfect. I will try to get the error state as soon as I hit it again.

Norbert
Comment 130 Chris Wilson 2013-01-09 02:44:39 UTC
Created attachment 72697 [details] [review]
Longshot 1: remove g4x/g5 specific MI_FLUSH

This bit is described as disabling the reload of indirect state pointers from the context. Since we aren't using any of that, toggling this bit shouldn't do anything. However, it is g4x/g5 specific...
Comment 131 Chris Wilson 2013-01-09 02:46:24 UTC
Created attachment 72698 [details] [review]
Longshot 2: make the shrinker less aggressive towards instruction bo

A variation on the shrinker, with the theory being that it is the kernel / instruction state that is being corrupted by the rebinding.
Comment 132 Norbert Preining 2013-01-10 04:26:15 UTC
Hi everyone,

(In reply to comment #129)
> (In reply to comment #126)
> > Ok. seems to work very stable now. I am running now a few days with the
> > patches xf86-intel driver and this patch (#122), and didn't have any hiccups
> > at all.
> 
> Ok, I got one. It took a *long* time, and I (stupidly) didn't grab the error
> state (too tired after 24 hours of flight/travel), but it definitely
> happened while running a apt-get upgrade after a long time, while at the
> same time browsing photos from my travel.
> 
> Let's say, the situation as wastely improved, but is not perfect. I will try
> to get the error state as soon as I hit it again.

Actually trying it over, I see the following:

3.7.0 plus patch from #122 with patched intel driver is rock solid
3.8.0-rc2 with patched intel driver (but no kernel patch) hangs (uploading the error state file soon)

I will try 3.8.0-rc3 with #122 patch plus the two from Chris 130/131 now.

Norbert
Comment 133 Norbert Preining 2013-01-10 04:27:39 UTC
Created attachment 72770 [details]
i915 error state, 3.8.0-rc2, no patches
Comment 134 Daniel Vetter 2013-01-10 17:14:28 UTC
Everyone please retest with latest drm-intel-fixes from

http://cgit.freedesktop.org/~danvet/drm-intel

I've just merged a bunch of duct-tapes for this issue. For those who can only reproduce the hangs with rc6 enabled, please also try reenabling that with i915.i915_enable_rc6=1.
Comment 135 Chris Wilson 2013-01-11 10:19:14 UTC
Created attachment 72847 [details] [review]
Hang me

So now the workaround is upstream, we need to find a way to retrigger the bug...

This patch causes us to unbind everything after each batch - but it also causes execution to be serialised. So the timing is going to be completely different versus the IO related hangs... We might try evicting before the batch instead.
Comment 136 Chris Wilson 2013-01-12 16:20:27 UTC
*** Bug 59280 has been marked as a duplicate of this bug. ***
Comment 137 Norbert Preining 2013-01-14 02:07:49 UTC
> 3.7.0 plus patch from #122 with patched intel driver is rock solid
> 3.8.0-rc2 with patched intel driver (but no kernel patch) hangs (uploading
> the error state file soon)
> 
> I will try 3.8.0-rc3 with #122 patch plus the two from Chris 130/131 now.

3.8.0-rc3 with patches from #122, #130, #131 seems to be very stable again.

Concerning the other two requests: Since testing on *absence* of the bug always takes me a few days until I am convinced that it does not appear, which of the two #134 or #135 should I try next, taking into account that the patches from 122,130,131 do work out in some way.

Thanks

Norbert
Comment 138 Daniel Vetter 2013-01-14 17:35:45 UTC
*** Bug 56916 has been marked as a duplicate of this bug. ***
Comment 139 Daniel Vetter 2013-01-14 17:35:53 UTC
*** Bug 57122 has been marked as a duplicate of this bug. ***
Comment 140 Daniel Vetter 2013-01-14 17:36:39 UTC
*** Bug 57136 has been marked as a duplicate of this bug. ***
Comment 141 Chris Wilson 2013-01-15 13:14:09 UTC
Created attachment 73082 [details] [review]
Drop caches

I can reproduce this using the attached patch and UXA on ilk:

$ while sleep .5; do echo 15 > /sys/kernel/debug/dri/0/i915_gem_drop_caches ; done &
$ DISPLAY=:0 CAIRO_TEST_TARGET=xlib ./cairo-perf-trace -i6 cairo-traces/benchmark/firefox-fishtank.trace

Dies in mere seconds.
Comment 142 Chris Wilson 2013-01-15 16:20:47 UTC
Created attachment 73105 [details] [review]
Invalidate the presumed_offsets along the slow relocation path
Comment 143 Chris Wilson 2013-01-16 11:28:39 UTC
commit 262b6d363fcff16359c93bd58c297f961f6e6273
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Jan 15 16:17:54 2013 +0000

    drm/i915: Invalidate the relocation presumed_offsets along the slow path
Comment 144 Florian Mickler 2013-01-19 23:02:07 UTC
A patch referencing this bug report has been merged in Linux v3.8-rc4:

commit 93927ca52a55c23e0a6a305e7e9082e8411ac9fa
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Thu Jan 10 18:03:00 2013 +0100

    drm/i915: Revert shrinker changes from "Track unbound pages"
Comment 145 Florian Mickler 2013-01-19 23:02:59 UTC
A patch referencing this bug report has been merged in Linux v3.8-rc4:

commit 901593f2bf221659a605bdc1dcb11376ea934163
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Dec 19 16:51:06 2012 +0000

    drm: Only evict the blocks required to create the requested hole
Comment 146 Milan Bouchet-Valat 2013-01-25 10:29:20 UTC
I can confirm I have not experienced the bug with drm-intel-next after several days of testing. What's the procedure next? Are the patches going to be backported to 3.7.x?
Comment 147 Daniel Vetter 2013-01-25 10:34:24 UTC
(In reply to comment #146)
> I can confirm I have not experienced the bug with drm-intel-next after
> several days of testing. What's the procedure next? Are the patches going to
> be backported to 3.7.x?

The band-aid is backported already afaik, the real fix should show up in the next 3.7.x point release (currently under -stable review).
Comment 148 Milan Bouchet-Valat 2013-01-25 11:15:06 UTC
Ah, thanks, great! Now I see that the two patches from Comment #144 and Comment #145 are included in 3.7.3. So I guess the real fix you're talking about it the one from Comment #143.
Comment 149 Daniel Vetter 2013-01-25 16:42:21 UTC
(In reply to comment #148)
> Ah, thanks, great! Now I see that the two patches from Comment #144 and
> Comment #145 are included in 3.7.3. So I guess the real fix you're talking
> about it the one from Comment #143.

Yep, that's right.