57136 – [GM45 regression] GPU hang during disk io

Bug 57136 - [GM45 regression] GPU hang during disk io

Summary: [GM45 regression] GPU hang during disk io

Status:	CLOSED DUPLICATE of bug 55984

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	unspecified
Hardware:	Other All

Importance:	highest normal
Assignee:	Imre Deak
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2012-11-15 00:05 UTC by Sandro Mani
Modified:	2016-10-07 05:30 UTC (History)
CC List:	9 users (show)

See Also:
i915 platform:
i915 features:

Attachments
backtrace (26.01 KB, text/plain) 2012-11-15 00:05 UTC, Sandro Mani	no flags	Details
i915_err_state (1.40 MB, text/plain) 2012-11-15 00:43 UTC, Sandro Mani	no flags	Details
dmesg (204.78 KB, text/plain) 2012-11-18 19:45 UTC, Sandro Mani	no flags	Details
dmesg, including boot stage (91.18 KB, text/plain) 2012-11-18 23:43 UTC, Sandro Mani	no flags	Details
kernel config (119.68 KB, text/plain) 2012-11-19 17:20 UTC, Sandro Mani	no flags	Details
make the shrinker less aggressive (2.18 KB, patch) 2012-12-19 13:41 UTC, Daniel Vetter	no flags	Details \| Splinter Review
*tar.gz containing output of 'dmesg', Xorg.0.log, /var/gdm/:0.log, empty i915_error_state** (97.43 KB, application/x-tar) 2012-12-23 00:47 UTC, Tom London	no flags	Details
Another tar.gz, this time with i915_error_state, dmesg, Xorg.0.log, etc. (304.61 KB, application/x-tar) 2012-12-23 16:49 UTC, Tom London	no flags	Details
*tar.gz containing 'dmesg', i915_error_state, Xorg.0.log, gdm/0:.log** (300.02 KB, application/x-gzip) 2012-12-27 00:55 UTC, Tom London	no flags	Details
*tar.gz containing dmesg, i915_error_state, Xorg.0.log, /var/log/gdm/.log** (296.76 KB, application/x-gzip) 2013-01-08 14:51 UTC, Tom London	no flags	Details
Longshot 1: remove g4x/g5 specific MI_FLUSH (848 bytes, patch) 2013-01-09 02:41 UTC, Chris Wilson	no flags	Details \| Splinter Review
Longshot 2: make the shrinker less aggressive towards instruction bo (848 bytes, patch) 2013-01-09 02:41 UTC, Chris Wilson	no flags	Details \| Splinter Review
Longshot 2: make the shrinker less aggressive towards instruction bo (2.18 KB, patch) 2013-01-09 14:40 UTC, Chris Wilson	no flags	Details \| Splinter Review
tar.gz included 'dmesg', i915_error_state, etc. (298.76 KB, application/x-gzip) 2013-01-10 19:51 UTC, Tom London	no flags	Details
Show Obsolete (1) View All

Description Sandro Mani 2012-11-15 00:05:00 UTC

Created attachment 70090 [details]
backtrace

Probably is an exact duplicate of #51376, which though is claimed to be fixed.

Hardware is a GM45 (GMA4500)
OS is fedora rawhide, with relevant packages
mesa-dri-drivers-9.0-5.fc19.x86_64
kernel-3.7.0-0.rc5.git1.3.fc19.x86_64

Reproducible: often
I've noticed three times always in a similar scenario, beeing io pressure (busy harddrive, i.e. from a yum update) and watching some html5 video content (youtube).

Backtrace attached.
i915_error_state attached.

Dmesg shows a gpu lockup:
[39911.544036] [drm] GMBUS [i915 gmbus dpb] timed out, falling back to bit banging on pin 5
[41612.416035] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[41612.416041] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
[41620.256016] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[41620.308044] [drm:init_ring_common] *ERROR* render ring initialization failed ctl 0001f001 head 00003000 tail 00000000 start 00003000
[41621.828021] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[41621.828359] [drm:i915_reset] *ERROR* GPU hanging too fast, declaring wedged!
[41621.828363] [drm:i915_reset] *ERROR* Failed to reset chip.

Comment 1 Sandro Mani 2012-11-15 00:43:41 UTC

Created attachment 70091 [details]
i915_err_state

Comment 2 Eric Anholt 2012-11-18 18:35:02 UTC

Reassigning to kernel team, since it's probably issues in the memory pressure handling.  I've also seen it before, and my way to reproduce within a few minutes was to rsync / (I think to the local system, not from), while doing basic web browsing not involving Mesa.

Comment 3 Daniel Vetter 2012-11-18 18:57:48 UTC

Can you please attach dmesg, too?

Comment 4 Daniel Vetter 2012-11-18 19:16:22 UTC

A few more questions:
- Are you using a 3d compositor (gnome-shell, ...)?
- Are older kernels (like 3.6) stable?

Comment 5 Sandro Mani 2012-11-18 19:45:26 UTC

Created attachment 70234 [details]
dmesg

Attached is a dmesg from a situation when the system ran out of memory linking some absurdly huge library, which resulted in a gpu hang.

I am using kwin with compositing. With 3.6, I didn't notice the problem. Also with 3.6+drm-next (which I compiled for #55112) I never noticed the problem.

Comment 6 Daniel Vetter 2012-11-18 19:50:13 UTC

It's actually the boot-message from dmesg I'm interested in, specifically the e820 map and zone layout. Can you please attach a fresh one?

Comment 7 Sandro Mani 2012-11-18 23:43:54 UTC

Created attachment 70239 [details]
dmesg, including boot stage

Here you go

Comment 8 Daniel Vetter 2012-11-19 16:29:50 UTC

Ok, yet another new theory ... please attach your kernel .config, thanks.

Comment 9 Sandro Mani 2012-11-19 17:20:23 UTC

Created attachment 70270 [details]
kernel config

Comment 10 Daniel Vetter 2012-12-18 10:56:49 UTC

Please try out the patch at

https://patchwork.kernel.org/patch/1885411/

It has a decent chance to reduce gtt trashing, which might be good enough to again ducttape over the hangs. Or maybe change the pattern to be able to reproduce it much quicker. In any case, should be interesting ...

Comment 11 Daniel Vetter 2012-12-19 13:41:30 UTC

Created attachment 71807 [details] [review]
make the shrinker less aggressive

Duct-tape solution if it is one, but imo very much worth a try.

Comment 12 Sandro Mani 2012-12-19 15:18:54 UTC

I've now finished building a 3.7.0 kernel with your latest patch and will do some stress tests today or tomorrow - thanks!

Comment 13 Tom London 2012-12-23 00:47:15 UTC

Created attachment 72011 [details]
tar.gz containing output of 'dmesg', Xorg.0.log, /var/gdm/:0*.log, empty i915_error_state

I applied the patch suggested above and built a Fedora kernel, kernel-3.7.0-6.local.fc19.x86_64.

I then booted that kernel and drove excessive I/O load on my Thinkpad X200: I ran 'digikam', qemu-kvm of a Win7 image configured to use 2 cores, cat a 'cat BIGFILES >/dev/null'.

While the system didn't crash until I got all the above running, it did hang/crash.

This crash, I could not recover /system/kernel/debug/dri/0/i915_error_state: I got a 'page allocation failure' when I attempted to copy it.

I've been BZ'ing this on the fedora bz for a while here: https://bugzilla.redhat.com/show_bug.cgi?id=877461

That ticket has numerous more such failures/logs, included a few with non-zero i915_error_state files.

I believe the patch was built in this kernel:

+ '[' '!' -f /home/tbl/rpmbuild/SOURCES/make-the-shrinker-less-aggressive.patch ']'
Patch33333: make-the-shrinker-less-aggressive.patch
+ case "$patch" in
+ patch -p1 -F1 -s
+ chmod +x scripts/checkpatch.pl


Here is what I see in dmesg:

[ 1103.968037] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[ 1103.968330] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
[ 1105.845804] traps: gnome-shell[1259] trap int3 ip:39c9e4f597 sp:7fff189222d0 error:0
[ 1110.016026] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[ 1110.070657] [drm:init_ring_common] *ERROR* render ring initialization failed ctl 0001f001 head 00003000 tail 00000000 start 00003000
[ 1111.608050] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[ 1111.609856] [drm:i915_reset] *ERROR* GPU hanging too fast, declaring wedged!
[ 1111.609859] [drm:i915_reset] *ERROR* Failed to reset chip.
[ 1114.433338] gnome-shell (1259) used greatest stack depth: 1392 bytes left
[ 1250.731117] gnome-shell[2381]: segfault at 230 ip 00007fcabc3fd89f sp 00007fff0d9909d0 error 4 in i965_dri.so[7fcabc3ab000+b3000]
[ 1305.904089] gnome-shell[2446]: segfault at 230 ip 00007ffe0688089f sp 00007fff2883df50 error 4 in i965_dri.so[7ffe0682e000+b3000]
[ 1332.132006] [sched_delayed] sched: RT throttling activated
[ 1375.751786] gnome-shell[2500]: segfault at 230 ip 00007fa9663bb89f sp 00007fff5ad6a660 error 4 in i965_dri.so[7fa966369000+b3000]
[ 1715.826609] cat: page allocation failure: order:9, mode:0x40d0
[ 1715.828604] Pid: 2789, comm: cat Not tainted 3.7.0-6.local.fc19.x86_64 #1
[ 1715.830463] Call Trace:
[ 1715.832239]  [<ffffffff81167469>] warn_alloc_failed+0xe9/0x150
[ 1715.834110]  [<ffffffff8116a090>] ? page_alloc_cpu_notify+0x50/0x50
[ 1715.835995]  [<ffffffff810d8b6d>] ? trace_hardirqs_on+0xd/0x10
[ 1715.837676]  [<ffffffff8116bc25>] __alloc_pages_nodemask+0x8b5/0xb40
[ 1715.839345]  [<ffffffff811ad460>] alloc_pages_current+0xb0/0x120
[ 1715.840971]  [<ffffffff8116991e>] ? __free_pages_ok.part.54+0x9e/0xe0
[ 1715.842522]  [<ffffffff8116632a>] __get_free_pages+0x2a/0x80
[ 1715.844143]  [<ffffffff811b9c89>] kmalloc_order_trace+0x39/0x190
[ 1715.845784]  [<ffffffff811ba07d>] __kmalloc+0x29d/0x2d0
[ 1715.847337]  [<ffffffff811f8fcf>] seq_read+0x11f/0x3e0
[ 1715.848948]  [<ffffffff811d320c>] vfs_read+0xac/0x180
[ 1715.850413]  [<ffffffff811d3335>] sys_read+0x55/0xa0
[ 1715.851846]  [<ffffffff816fbd19>] system_call_fastpath+0x16/0x1b
[ 1715.853273] Mem-Info:
[ 1715.854697] Node 0 DMA per-cpu:
[ 1715.856193] CPU    0: hi:    0, btch:   1 usd:   0
[ 1715.857537] CPU    1: hi:    0, btch:   1 usd:   0
[ 1715.858823] Node 0 DMA32 per-cpu:
[ 1715.860145] CPU    0: hi:  186, btch:  31 usd:   0
[ 1715.861417] CPU    1: hi:  186, btch:  31 usd:   0
[ 1715.862668] Node 0 Normal per-cpu:
[ 1715.864038] CPU    0: hi:  186, btch:  31 usd:  32
[ 1715.865257] CPU    1: hi:  186, btch:  31 usd:   0
[ 1715.866418] active_anon:366511 inactive_anon:174404 isolated_anon:0
 active_file:60034 inactive_file:192590 isolated_file:0
 unevictable:30 dirty:25 writeback:0 unstable:0
 free:39181 slab_reclaimable:21162 slab_unreclaimable:95728
 mapped:29886 shmem:23042 pagetables:10701 bounce:0
 free_cma:0
[ 1715.872852] Node 0 DMA free:15848kB min:264kB low:328kB high:396kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:40kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15648kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:16kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 1715.876063] lowmem_reserve[]: 0 2947 3892 3892
[ 1715.877257] Node 0 DMA32 free:111548kB min:50976kB low:63720kB high:76464kB active_anon:1288852kB inactive_anon:502196kB active_file:207752kB inactive_file:687716kB unevictable:32kB isolated(anon):0kB isolated(file):0kB present:3018404kB mlocked:32kB dirty:36kB writeback:0kB mapped:94252kB shmem:50888kB slab_reclaimable:39640kB slab_unreclaimable:128856kB kernel_stack:840kB pagetables:20200kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 1715.882062] lowmem_reserve[]: 0 0 945 945
[ 1715.883246] Node 0 Normal free:27972kB min:16340kB low:20424kB high:24508kB active_anon:177192kB inactive_anon:195420kB active_file:32384kB inactive_file:83896kB unevictable:88kB isolated(anon):0kB isolated(file):0kB present:967680kB mlocked:88kB dirty:64kB writeback:0kB mapped:25292kB shmem:41280kB slab_reclaimable:45008kB slab_unreclaimable:254040kB kernel_stack:1960kB pagetables:22604kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 1715.888714] lowmem_reserve[]: 0 0 0 0
[ 1715.890149] Node 0 DMA: 2*4kB 2*8kB 1*16kB 0*32kB 3*64kB 2*128kB 2*256kB 1*512kB 2*1024kB 2*2048kB 2*4096kB = 15848kB
[ 1715.891696] Node 0 DMA32: 1573*4kB 1401*8kB 906*16kB 510*32kB 424*64kB 202*128kB 26*256kB 3*512kB 0*1024kB 1*2048kB 0*4096kB = 111548kB
[ 1715.893244] Node 0 Normal: 1393*4kB 669*8kB 203*16kB 144*32kB 58*64kB 25*128kB 4*256kB 1*512kB 0*1024kB 0*2048kB 0*4096kB = 27228kB
[ 1715.894843] 282546 total pagecache pages
[ 1715.896341] 6307 pages in swap cache
[ 1715.897947] Swap cache stats: add 54887, delete 48580, find 8649/10442
[ 1715.899457] Free swap  = 6015380kB
[ 1715.900944] Total swap = 6127612kB
[ 1715.919139] 1032176 pages RAM
[ 1715.920639] 52602 pages reserved
[ 1715.922114] 714600 pages shared
[ 1715.923569] 896850 pages non-shared

Let me know if I can provide more or test more....

Comment 14 Chris Wilson 2012-12-23 12:01:55 UTC

(In reply to comment #13)
> Created attachment 72011 [details]
> tar.gz containing output of 'dmesg', Xorg.0.log, /var/gdm/:0*.log, empty
> i915_error_state
> 
> I applied the patch suggested above and built a Fedora kernel,
> kernel-3.7.0-6.local.fc19.x86_64.
> 
> I then booted that kernel and drove excessive I/O load on my Thinkpad X200:
> I ran 'digikam', qemu-kvm of a Win7 image configured to use 2 cores, cat a
> 'cat BIGFILES >/dev/null'.
> 
> While the system didn't crash until I got all the above running, it did
> hang/crash.
> 
> This crash, I could not recover /system/kernel/debug/dri/0/i915_error_state:
> I got a 'page allocation failure' when I attempted to copy it.

Drat. Can't verify that the hang is the same as we are hunting without the error-state, but from the scenario it should be. So (as expected) we can assume that this is contradictory evidence that the patch is a sufficient workaround.

Comment 15 Tom London 2012-12-23 16:49:09 UTC

Created attachment 72036 [details]
Another tar.gz, this time with i915_error_state, dmesg, Xorg.0.log, etc.

Looks like I can easily recreate this hang by booting, starting a 'cat About-30G-files >/dev/null', and then starting 'digikam'.

This time, when it hung, I 'ctrl-alt-F2' to a terminal, logged in as root, killed off the offending 'cat' process, did a 'sync', and ran my script.

You should find the i915_error_state file in the tarball.

Let me know if you need another/more ...

Comment 16 Tom London 2012-12-23 16:50:38 UTC

Forgot to paste in the dmesg spew:

[  137.364357] DMA-API: debugging out of memory - disabling
[  242.788042] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[  242.788337] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
[  248.824064] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[  248.877066] [drm:init_ring_common] *ERROR* render ring initialization failed ctl 0001f001 head 00003000 tail 00000000 start 00003000
[  250.384039] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[  250.385463] [drm:i915_reset] *ERROR* GPU hanging too fast, declaring wedged!
[  250.385467] [drm:i915_reset] *ERROR* Failed to reset chip.
[  263.697098] gnome-shell[2020]: segfault at 230 ip 00007fc6035c689f sp 00007fffcbeb8d30 error 4 in i965_dri.so[7fc603574000+b3000]
[  304.669091] kworker/u:0 (6) used greatest stack depth: 2176 bytes left

Comment 17 Chris Wilson 2012-12-23 16:56:21 UTC

Thanks, the same hang as before so we can be certain that that particular workaround is not sufficient.

Comment 18 Tom London 2012-12-24 01:07:23 UTC

Looks like I can reproduce this pretty easily: I have a directory with my KVM guest images plus some CD ISO images.

Running 'cat all those files >/dev/null' hangs/crashes my system (I had only rhythmbox, firefox + a couple of gnome-terminal windows open).

This hang occurred with a newly built/booted Fedora kernel-3.7.1-1.local.fc19.x86_64 (above patch included).

Here is the dmesg spew. Let me know if the i915_error_state would be helpful.

[ 7438.644043] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[ 7438.644056] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
[ 7444.672037] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[ 7444.723034] [drm:init_ring_common] *ERROR* render ring initialization failed ctl 0001f001 head 00003000 tail 00000000 start 00003000
[ 7446.228045] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[ 7446.228178] [drm:i915_reset] *ERROR* GPU hanging too fast, declaring wedged!
[ 7446.228181] [drm:i915_reset] *ERROR* Failed to reset chip.
[ 7473.694112] gnome-shell[3157]: segfault at 230 ip 00007fc2f58c589f sp 00007fffd9fe8a60 error 4 in i965_dri.so[7fc2f5873000+b3000]

Comment 19 Chris Wilson 2012-12-26 21:26:56 UTC

Tom, the patches I have other people testing are:

https://bugs.freedesktop.org/attachment.cgi?id=72022
https://bugs.freedesktop.org/attachment.cgi?id=71933

Can you try both of those (kernel + ddx)?

Comment 20 Tom London 2012-12-26 22:23:18 UTC

Just to be clear, you want me to remove the previous patch before applying these, right?

Comment 21 Chris Wilson 2012-12-26 23:10:50 UTC

(In reply to comment #20)
> Just to be clear, you want me to remove the previous patch before applying
> these, right?

Yes. The idea is to find the minimal set of patches required, and hope it is an obvious single line change...

Comment 22 Tom London 2012-12-27 00:55:06 UTC

Created attachment 72157 [details]
tar.gz containing 'dmesg', i915_error_state, Xorg.0.log, gdm/0:*.log

OK. I've built:

kernel-3.7.1-1.local2.fc19.x86_64
xorg-x11-drv-intel-2.20.16-1.local.fc19.x86_64

with the above 2 patches, and rebooted:

[tbl@tlondon ~]$ uname -a
Linux tlondon.localhost.org 3.7.1-1.local2.fc19.x86_64 #1 SMP Wed Dec 26 15:21:18 PST 2012 x86_64 x86_64 x86_64 GNU/Linux
[tbl@tlondon ~]$ rpm -q xorg-x11-drv-intel
xorg-x11-drv-intel-2.20.16-1.local.fc19.x86_64
[tbl@tlondon ~]$ 

I started my "read lots of blocks from the disk command": 
     cat *.ISO *.img >/dev/null&

and ran a 'vmstat -10' in the terminal.

Within 2 minutes (of quite high disk traffic), I got what appears to be the usual hang/crash:

[  299.800029] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[  299.800036] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
[  311.840049] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[  311.892025] [drm:init_ring_common] *ERROR* render ring initialization failed ctl 0001f001 head 00003000 tail 00000000 start 00003000
[  313.396044] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[  313.396164] [drm:i915_reset] *ERROR* GPU hanging too fast, declaring wedged!
[  313.396167] [drm:i915_reset] *ERROR* Failed to reset chip.
[  351.595568] gnome-shell[1978]: segfault at 230 ip 00007fa61378989f sp 00007fff1d0a4730 error 4 in i965_dri.so[7fa613737000+b3000]

I'm pretty sure I'm applying the kernel patch properly:

+ case "$patch" in
+ patch -p1 -F1 -s
+ ApplyPatch 8139cp-re-enable-interrupts-after-tx-timeout.patch
+ local patch=8139cp-re-enable-interrupts-after-tx-timeout.patch
+ shift
+ '[' '!' -f /home/tbl/rpmbuild/SOURCES/8139cp-re-enable-interrupts-after-tx-timeout.patch ']'
Patch21233: 8139cp-re-enable-interrupts-after-tx-timeout.patch
+ case "$patch" in
+ patch -p1 -F1 -s
+ ApplyPatch only-evict-block-required-for-requested-hole.patch
+ local patch=only-evict-block-required-for-requested-hole.patch
+ shift
+ '[' '!' -f /home/tbl/rpmbuild/SOURCES/only-evict-block-required-for-requested-hole.patch ']'
Patch33334: only-evict-block-required-for-requested-hole.patch
+ case "$patch" in
+ patch -p1 -F1 -s
+ chmod +x scripts/checkpatch.pl
+ touch .scmversion

I attach a tar.gz with the usual files, including a legit looking i915_error_state.

Comment 23 Tom London 2013-01-05 19:01:10 UTC

Updated to xorg-x11-drv-intel-2.20.17-1.fc19.x86_64, reran my "disk load" test ("cat bigfiles >/dev/null"), and waited.

Within about 2 minutes gdm/Xorg hard crashed, the screen was black, and the system was unresponsive to the usual keyboard entries (i.e., ctrl-alt-F2, ctrl-alt-bksp, ctrl-alt-delete).

I did not get the "gdm Ooops something has gone wrong" screen.

I had to hard power reset the system.

On rebooting, I see this in /var/log/messages.


Jan  5 10:27:59 tlondon kernel: [ 2017.404040] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
Jan  5 10:27:59 tlondon kernel: [ 2017.404047] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
Jan  5 10:28:05 tlondon kernel: [ 2023.424023] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
Jan  5 10:28:05 tlondon kernel: [ 2023.475044] [drm:init_ring_common] *ERROR* render ring initialization failed ctl 0001f001 head 00003000 tail 00000000 start 00003000
Jan  5 10:28:06 tlondon kernel: [ 2025.140021] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
Jan  5 10:28:06 tlondon kernel: [ 2025.140106] [drm:i915_reset] *ERROR* GPU hanging too fast, declaring wedged!
Jan  5 10:28:06 tlondon kernel: [ 2025.140108] [drm:i915_reset] *ERROR* Failed to reset chip.
Jan  5 10:28:07 tlondon kernel: [ 2025.214077] ------------[ cut here ]------------
Jan  5 10:28:07 tlondon kernel: [ 2025.215017] kernel BUG at drivers/gpu/drm/i915/i915_gem.c:3476!
Jan  5 10:28:07 tlondon kernel: [ 2025.215017] invalid opcode: 0000 [#1] SMP 
Jan  5 10:28:07 tlondon kernel: [ 2025.215017] Modules linked in: fuse(F) ip6table_filter(F) ip6_tables(F) ebtable_nat(F) ebtables(F) ipt_MASQUERADE(F) iptable_nat(F) nf_nat_ipv4(F) nf_nat(F) nf_conntrack_ipv4(F) nf_defrag_ipv4(F) xt_conntrack(F) nf_conntrack(F) xt_CHECKSUM(F) iptable_mangle(F) bridge(F) stp(F) llc(F) lockd(F) sunrpc(F) snd_usb_audio(F) snd_hda_codec_conexant(F) snd_usbmidi_lib(F) arc4(F) iwldvm(F) snd_hda_intel(F) snd_hda_codec(F) uvcvideo(F) snd_hwdep(F) snd_rawmidi(F) snd_seq(F) snd_seq_device(F) mac80211(F) videobuf2_vmalloc(F) videobuf2_memops(F) videobuf2_core(F) videodev(F) snd_pcm(F) thinkpad_acpi(F) iwlwifi(F) snd_page_alloc(F) media(F) snd_timer(F) snd(F) cfg80211(F) soundcore(F) e1000e(F) btusb(F) iTCO_wdt(F) bluetooth(F) coretemp(F) iTCO_vendor_support(F) mei(F) tpm_tis(F) tpm(F) lpc_ich(F) rfkill(F) mfd_core(F) i2c_i801(F) tpm_bios(F) microcode(F) vhost_net(F) tun(F) macvtap(F) macvlan(F) kvm_intel(F) kvm(F) binfmt_misc(F) uinput(F) i915(F) i2c_algo_bit(F) drm_kms_helper(F) drm(F) i2c_core(F) wmi(F) video(F)
Jan  5 10:28:07 tlondon kernel: [ 2025.215017] CPU 0 
Jan  5 10:28:07 tlondon kernel: [ 2025.215017] Pid: 660, comm: Xorg Tainted: GF            3.7.1-1.local2.fc19.x86_64 #1 LENOVO 74585FU/74585FU
Jan  5 10:28:07 tlondon kernel: [ 2025.215017] RIP: 0010:[<ffffffffa009c847>]  [<ffffffffa009c847>] i915_gem_object_unpin+0x47/0x50 [i915]
Jan  5 10:28:07 tlondon kernel: [ 2025.215017] RSP: 0018:ffff880134be7938  EFLAGS: 00010246
Jan  5 10:28:07 tlondon kernel: [ 2025.215017] RAX: ffff880130a78000 RBX: ffff880130da3800 RCX: 0000000000000000
Jan  5 10:28:07 tlondon kernel: [ 2025.215017] RDX: 0000000000000002 RSI: 0000000000070008 RDI: ffff8801262db400
Jan  5 10:28:07 tlondon kernel: [ 2025.215017] RBP: ffff880134be7938 R08: 0000000000000030 R09: 0000000000000006
Jan  5 10:28:07 tlondon kernel: [ 2025.215017] R10: 0000000000000000 R11: 0000000000000001 R12: ffff880130da0800
Jan  5 10:28:07 tlondon kernel: [ 2025.215017] R13: ffff880130da0820 R14: 0000000000000000 R15: ffff880130da0800
Jan  5 10:28:07 tlondon kernel: [ 2025.215017] FS:  00007fc5f1d5f940(0000) GS:ffff88013bc00000(0000) knlGS:0000000000000000
Jan  5 10:28:07 tlondon kernel: [ 2025.215017] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Jan  5 10:28:07 tlondon kernel: [ 2025.215017] CR2: 00000000008054bc CR3: 0000000130822000 CR4: 00000000000007f0
Jan  5 10:28:07 tlondon kernel: [ 2025.215017] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jan  5 10:28:07 tlondon kernel: [ 2025.215017] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jan  5 10:28:07 tlondon kernel: [ 2025.215017] Process Xorg (pid: 660, threadinfo ffff880134be6000, task ffff880130964560)
Jan  5 10:28:07 tlondon kernel: [ 2025.215017] Stack:
Jan  5 10:28:07 tlondon kernel: [ 2025.215017]  ffff880134be7948 ffffffffa00adf5e ffff880134be7978 ffffffffa00b17e6
Jan  5 10:28:07 tlondon kernel: [ 2025.215017]  ffff8801338497d8 ffff880130da3800 0000000000000001 ffff880130da0c50
Jan  5 10:28:07 tlondon kernel: [ 2025.215017]  ffff880134be7c08 ffffffffa00b43d2 ffff880100000001 000000008121ac18
Jan  5 10:28:07 tlondon kernel: [ 2025.215017] Call Trace:
Jan  5 10:28:07 tlondon kernel: [ 2025.215017]  [<ffffffffa00adf5e>] intel_unpin_fb_obj+0x3e/0x40 [i915]
Jan  5 10:28:07 tlondon kernel: [ 2025.215017]  [<ffffffffa00b17e6>] intel_crtc_disable+0x96/0x130 [i915]
Jan  5 10:28:07 tlondon kernel: [ 2025.215017]  [<ffffffffa00b43d2>] intel_set_mode+0x262/0xa50 [i915]
Jan  5 10:28:07 tlondon kernel: [ 2025.215017]  [<ffffffff8121d26c>] ? ext4_dirty_inode+0x3c/0x60
Jan  5 10:28:07 tlondon kernel: [ 2025.215017]  [<ffffffff8125b182>] ? jbd2_journal_stop+0x1b2/0x2a0
Jan  5 10:28:07 tlondon kernel: [ 2025.215017]  [<ffffffff81237dc6>] ? __ext4_journal_stop+0x76/0xa0
Jan  5 10:28:07 tlondon kernel: [ 2025.215017]  [<ffffffff8121badd>] ? ext4_da_write_end+0x9d/0x350
Jan  5 10:28:07 tlondon kernel: [ 2025.215017]  [<ffffffff812f1a31>] ? vsnprintf+0x461/0x600
Jan  5 10:28:07 tlondon kernel: [ 2025.215017]  [<ffffffff812f1c74>] ? snprintf+0x34/0x40
Jan  5 10:28:07 tlondon kernel: [ 2025.215017]  [<ffffffffa00b4d11>] ? intel_crtc_set_config+0x151/0x970 [i915]
Jan  5 10:28:07 tlondon kernel: [ 2025.215017]  [<ffffffffa00b52d6>] intel_crtc_set_config+0x716/0x970 [i915]
Jan  5 10:28:07 tlondon kernel: [ 2025.215017]  [<ffffffff81633af6>] ? __schedule+0x3c6/0x7a0
Jan  5 10:28:07 tlondon kernel: [ 2025.215017]  [<ffffffffa0037286>] drm_framebuffer_remove+0xc6/0x150 [drm]
Jan  5 10:28:07 tlondon kernel: [ 2025.215017]  [<ffffffffa003ac75>] drm_mode_rmfb+0xd5/0xe0 [drm]
Jan  5 10:28:07 tlondon kernel: [ 2025.215017]  [<ffffffffa002a4a3>] drm_ioctl+0x4d3/0x580 [drm]
Jan  5 10:28:07 tlondon kernel: [ 2025.215017]  [<ffffffff811d3402>] ? send_to_group+0x182/0x250
Jan  5 10:28:07 tlondon kernel: [ 2025.215017]  [<ffffffffa003aba0>] ? drm_mode_addfb2+0x6d0/0x6d0 [drm]
Jan  5 10:28:07 tlondon kernel: [ 2025.215017]  [<ffffffff811d372f>] ? fsnotify+0x25f/0x340
Jan  5 10:28:07 tlondon kernel: [ 2025.215017]  [<ffffffff811a6649>] do_vfs_ioctl+0x99/0x580
Jan  5 10:28:07 tlondon kernel: [ 2025.215017]  [<ffffffff8128b94a>] ? inode_has_perm.isra.31.constprop.61+0x2a/0x30
Jan  5 10:28:07 tlondon kernel: [ 2025.215017]  [<ffffffff8128cd17>] ? file_has_perm+0x97/0xb0
Jan  5 10:28:07 tlondon kernel: [ 2025.215017]  [<ffffffff811a6bc1>] sys_ioctl+0x91/0xb0
Jan  5 10:28:07 tlondon kernel: [ 2025.215017]  [<ffffffff810dc8cc>] ? __audit_syscall_exit+0x3ec/0x450
Jan  5 10:28:07 tlondon kernel: [ 2025.215017]  [<ffffffff8163d9d9>] system_call_fastpath+0x16/0x1b
Jan  5 10:28:07 tlondon kernel: [ 2025.215017] Code: 00 74 2a 89 d0 83 e2 0f c0 e8 04 83 e8 01 83 e0 0f 89 c1 c1 e1 04 09 ca 84 c0 88 97 e9 00 00 00 75 07 80 a7 ea 00 00 00 fb 5d c3 <0f> 0b 0f 0b 0f 1f 44 00 00 66 66 66 66 90 55 48 89 e5 41 57 41 
Jan  5 10:28:07 tlondon kernel: [ 2025.215017] RIP  [<ffffffffa009c847>] i915_gem_object_unpin+0x47/0x50 [i915]
Jan  5 10:28:07 tlondon kernel: [ 2025.215017]  RSP <ffff880134be7938>

Comment 24 Tom London 2013-01-05 21:28:00 UTC

More:

This just popped up in dmesg:


[10213.840108] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
[10213.841102] i915: render error detected, EIR: 0x00000010
[10213.841102] i915:   IPEIR: 0x00000000
[10213.841102] i915:   IPEHR: 0x69040000
[10213.841102] i915:   INSTDONE_0: 0xffffffff
[10213.841102] i915:   INSTDONE_1: 0xbfbbffff
[10213.841102] i915:   INSTDONE_2: 0x00000000
[10213.841102] i915:   INSTDONE_3: 0x00000000
[10213.841102] i915:   INSTPS: 0x8001e025
[10213.841102] i915:   ACTHD: 0x055b608c
[10213.841102] i915: page table error
[10213.841102] i915:   PGTBL_ER: 0x00000001
[10213.841102] [drm:i915_report_and_clear_eir] *ERROR* EIR stuck: 0x00000010, masking


i915_error_state was empty:

[root@tlondon dri]# ls -l i915_error_state 
-rw-r--r--. 1 root root 0 Jan  5 13:25 i915_error_state
[root@tlondon dri]#

Comment 25 Tom London 2013-01-06 20:03:22 UTC

I seem to be able to reproduce at will.

Any more testing/reporting/building I can do to help?

Comment 26 Tom London 2013-01-08 14:51:17 UTC

Created attachment 72677 [details]
tar.gz containing dmesg, i915_error_state, Xorg.0.log, /var/log/gdm/*.log

Hang/crash continues with kernel-3.8.0-0.rc2.git2.2.fc19.x86_64 and xorg-x11-drv-intel-2.20.17-1.fc19.x86_64.


[  368.708039] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[  368.708047] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
[  376.708382] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[  376.759026] [drm:init_ring_common] *ERROR* render ring initialization failed ctl 0001f001 head 00003000 tail 00000000 start 00003000
[  378.704039] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[  378.704541] [drm:i915_reset] *ERROR* GPU hanging too fast, declaring wedged!
[  378.704543] [drm:i915_reset] *ERROR* Failed to reset chip.
[  403.384220] gnome-shell[1981]: segfault at 230 ip 00007f40b0fb989f sp 00007fff1db58130 error 4 in i965_dri.so[7f40b0f67000+b3000]

Here are the first 20 lines of i915_error_state:

Time: 1357655991 s 435476 us
PCI ID: 0x2a42
EIR: 0x00000000
IER: 0x02028c53
PGTBL_ER: 0x00000000
CCID: 0x00000000
  fence[0] = 00000000
  fence[1] = 00000000
  fence[2] = 00000000
  fence[3] = 591e0000511f0dd
  fence[4] = 00000000
  fence[5] = 00000000
  fence[6] = 00000000
  fence[7] = 00000000
  fence[8] = 00000000
  fence[9] = 00000000
  fence[10] = 00000000
  fence[11] = 00000000
  fence[12] = 00000000
  fence[13] = 00000000


I attach tar.gz containing dmesg output, /var/log/gdm/*.log, Xorg.0.log and i915_error_state.

More I can do?

Comment 27 Chris Wilson 2013-01-09 02:41:09 UTC

Created attachment 72695 [details] [review]
Longshot 1: remove g4x/g5 specific MI_FLUSH

Comment 28 Chris Wilson 2013-01-09 02:41:58 UTC

Created attachment 72696 [details] [review]
Longshot 2: make the shrinker less aggressive towards instruction bo

Comment 29 Jani Nikula 2013-01-09 08:21:40 UTC

Tom, please try Chris' patches.

Comment 30 Tom London 2013-01-09 14:15:52 UTC

Uhhhh, I only see one patch. Both posted patches are the same...

That right?

Comment 31 Daniel Vetter 2013-01-09 14:19:24 UTC

(In reply to comment #30)
> Uhhhh, I only see one patch. Both posted patches are the same...
> 
> That right?

Indeed. You could try the original "make shrinker less aggressive" patch though.

Comment 32 Chris Wilson 2013-01-09 14:40:58 UTC

Created attachment 72727 [details] [review]
Longshot 2: make the shrinker less aggressive towards instruction bo

It was almost 3am when I tried to upload the patches...

Comment 33 Tom London 2013-01-09 14:53:34 UTC

Having problems applying last patch:

+ patch -p1 -F1 -s
+ ApplyPatch 0002-make-the-shrinker-less-aggressive.patch
+ local patch=0002-make-the-shrinker-less-aggressive.patch
+ shift
+ '[' '!' -f /home/tbl/rpmbuild/SOURCES/0002-make-the-shrinker-less-aggressive.patch ']'
Patch33336: 0002-make-the-shrinker-less-aggressive.patch
+ case "$patch" in
+ patch -p1 -F1 -s
1 out of 2 hunks FAILED -- saving rejects to file drivers/gpu/drm/i915/i915_gem.c.rej
error: Bad exit status from /var/tmp/rpm-tmp.lcmyBp (%prep)

Here is the .rej file:

--- drivers/gpu/drm/i915/i915_gem.c
+++ drivers/gpu/drm/i915/i915_gem.c
@@ -4470,11 +4515,8 @@
                unlock = false;
        }

-       if (nr_to_scan) {
-               nr_to_scan -= i915_gem_purge(dev_priv, nr_to_scan);
-               if (nr_to_scan > 0)
-                       i915_gem_shrink_all(dev_priv);
-       }
+       if (nr_to_scan)
+               i915_gem_shrink(dev_priv, nr_to_scan);

        cnt = 0;
        list_for_each_entry(obj, &dev_priv->mm.unbound_list, gtt_list)
~                                                                               
~

Comment 34 Chris Wilson 2013-01-09 15:00:25 UTC

Was written against drm-intel-fixes, so should apply to 3.8-rc2 fine. Which kernel are you testing?

Comment 35 Tom London 2013-01-09 15:06:41 UTC

Sorry.  Was applying to 3.7.1.

Will grab source for kernel-3.8.0-0.rc2.git3.1.fc19.x86_64 and start again....

Comment 36 Tom London 2013-01-09 15:32:45 UTC

Am having problems building with the src rpm I pulled.

Will have to try again tonight.

Comment 37 Tom London 2013-01-10 04:37:31 UTC

I pulled kernel-3.8.0-0.rc2.git4.2.fc19.src.rpm from http://alt.fedoraproject.org/pub/alt/rawhide-kernel-nodebug/SRPMS/.

The patching now succeeds; I am building now....

Comment 38 Tom London 2013-01-10 14:34:17 UTC

Sorry for the delay, but the build took a few tries...

3.8.0-0.rc2.git4.2.local.fc19.x86_64 with the 2 above patches has been running my "cat 43GB-files >/dev/null" crasher for several minutes now without incident.

This is great!!!!!!

I haven't been able to complete this test, so this is quite a change.

No spew in dmesg either; stable gdm/Xorg/...

THANKS!!!!

I am now repeating this test. I will monitor and report.

More I can do?

Here is output from vmstat:

[tbl@tlondon VirtualMachines]$ vmstat 10
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 0  1      0 1775656 105060 1210000    0    0  3056   107  727 1203 13  5 37 44
 0  2      0 1042640 105072 1941068    0    0 73442    54 1071 2032  4  5 63 28
 0  2      0 267392 105092 2714712    0    0 77370     7  980 1905  3  5 52 40
 0  1      0 144492  31080 2931872    0    0 85133     4 1337 2093  4  8 62 26
 0  1      0 149396  31072 2930492    0    0 87949     0 1263 1910  3  9 64 24
 0  1      0 150632  31188 2928340    0    0 83795     7 1257 1946  3  7 55 35
 1  3      0 146360  31196 2938700    0    0 86184     2 1198 1886  2  6 47 45
 0  1      0 146504  31204 2943096    0    0 83787     3 1208 1929  3  7 52 38
 1  0      0 145768  31200 2948804    0    0 89929     2 1244 1924  3  8 60 29
 0  1      0 144688  31208 2953332    0    0 104573     1 1442 2143  4 10 60 26
 0  1      0 148116  31100 2952640    0    0 105757     1 1340 2044  3  9 56 32
 1  0      0 151436  31096 2952480    0    0 107600     0 1365 2096  3 10 65 22
 0  1      0 150376  31092 2953792    0    0 106744     0 1363 2067  3  9 55 33
 0  1      0 149860  31088 2955348    0    0 103801     0 1332 2097  3  8 52 37
 0  1      0 144076  31120 2961904    0    0 96686     1 1468 2063  3  9 56 31
 1  1      0 150732  31124 2953044    0    0 100161     0 1378 2082  4 10 62 25
 0  1      0 146408  31128 2958476    0    0 95870     0 1301 1984  3  9 58 30
 1  0      0 145520  31136 2959972    0    0 97358     0 1371 2060  3  8 52 37
 0  1      0 146996  31152 2958508    0    0 82962     5 1244 1951  3  7 52 38
 3  0      0 147996  31160 2954000    0    0 80578     2 1316 2006  3  8 60 29
 0  1      0 145532  31168 2957428    0    0 79778     0 1241 1897  3  7 58 32
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 1  1      0 127132  31320 2972564    0    0 73284  1197 1214 1921  5  6 48 41
 0  1      0 144656  31288 2951296    0    0 69968    13 1268 2141  5  8 53 35
 1  1      0 148924  31292 2946676    0    0 75920     4 1243 1975  3  8 61 28
 0  1      0 151308  31308 2939844    0    0 75828     1 1202 1903  3  7 57 33
 0  1      0 145636  31320 2946304    0    0 72241    16 1228 1913  3  7 55 35
 0  1      0 150740  31332 2941224    0    0 73810     3 1207 1948  3  8 58 31
 1  0      0 146628  31356 2944968    0    0 73763     0 1328 2007  3  8 61 28
 0  1      0 149832  31360 2941776    0    0 75132     2 1190 1905  3  7 63 27
 0  1      0 146020  31408 2915404    0    0 65193   194 1406 2064  5  6 43 46
 0  2      0 145768  31344 2915824    0    0 71813    11 1288 2074  3  6 50 41
 5  1      0 147704  31352 2913676    0    0 72329     6 1180 1835  3  6 49 42
 1  1      0 146844  31352 2913724    0    0 74676   166 1197 1876  3  7 60 30
 0  1      0 147560  31368 2900980    0    0 66823     4 1192 2036  4  6 50 40
 1  1      0 150736  31376 2897716    0    0 74338    17 1228 2133  3  6 47 44
 0  1      0 151344  31380 2894552    0    0 76119     0 1239 2108  3  7 51 39
 1  0      0 150808  31392 2895576    0    0 73536     1 1269 2170  3  7 56 34
 0  2      0 147624  31400 2898924    0    0 74670     2 1215 2090  3  6 57 33
 0  1      0 144880  31388 2901828    0    0 67518     6 1312 2129  4  8 50 38
 0  2      0 150752  31404 2897156    0    0 73924     2 1254 2135  3  6 50 41
 1  0      0 145128  31408 2903132    0    0 74100     1 1241 2140  4  7 58 32
 1  1      0 151144  31448 2894644    0    0 71395    30 1240 2135  3  6 51 39
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 2  1      0 148172  31464 2907324    0    0 71740     8 1203 2084  4  6 50 41
 0  2      0 151500  31480 2883140    0    0 62043    20 1552 2896  7  9 43 42
 0  1      0 148668  31476 2878704    0    0 67546   159 1329 2325  4  7 46 43
 0  1      0 149040  31524 2873000    0    0 66586    64 1298 2341  5  8 45 43
 5  0      0 124592  31472 2893828    0    0 41530    47 1734 2170  5 30 40 25
 0  1      0 145252  31488 2872904    0    0   276    60 1108 1916  7 12 79  2
 2  0      0 131864  31504 2881628    0    0    69    42 1270 2818 12  7 80  1
 0  0      0 141808  31616 2871964    0    0    65     2 1069 2359 11  7 82  0
 1  0      0 136560  31640 2871916    0    0    58    42 1307 2919 14  6 79  1
 0  0      0 111908  31664 2895860    0    0  1181  1198  905 1912 10  3 85  2
 0  0      0 120600  31688 2896620    0    0    77    52  736 1564  7  2 90  1
 0  1      0 117952  31704 2896928    0    0    29    39  678 1353  6  2 90  1
 2  0      0 113708  31720 2897192    0    0    26    25  777 1593  7  3 89  1

Comment 39 Chris Wilson 2013-01-10 16:14:14 UTC

(In reply to comment #38)
> More I can do?

Can you run you test with just the first patch, https://bugs.freedesktop.org/attachment.cgi?id=72695 ? I'm interested if that path is worth pursuing, or if it is just a dead end.

Comment 40 Tom London 2013-01-10 16:23:02 UTC

OK. I just started a build with just one patch:  0002-remove-MI-FLUSH.patch

I've commented out 0002-make-the-shrinker-less-aggressive.patch

[Takes a while to build on my laptop.]

Will retest when complete and report back.

Comment 41 Daniel Vetter 2013-01-10 17:15:29 UTC

Everyone please retest with latest drm-intel-fixes from

http://cgit.freedesktop.org/~danvet/drm-intel

I've just merged a bunch of duct-tapes for this issue.

Comment 42 Tom London 2013-01-10 19:51:27 UTC

Created attachment 72801 [details]
tar.gz included 'dmesg', i915_error_state, etc.

kernel with just the one patch (remove-MI_FLUSH) hangs/crashes under my "cat big-files >/dev/null" test.

Here is the dmesg spew:

[  178.704031] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[  178.704039] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
[  188.704040] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[  188.757040] [drm:init_ring_common] *ERROR* render ring initialization failed ctl 0001f001 head 00003000 tail 00000000 start 00003000
[  190.704040] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[  190.704160] [drm:i915_reset] *ERROR* GPU hanging too fast, declaring wedged!
[  190.704163] [drm:i915_reset] *ERROR* Failed to reset chip.
[  201.972860] gnome-shell[1927]: segfault at 230 ip 00007fbcbf78989f sp 00007fff8ae7ac00 error 4 in i965_dri.so[7fbcbf737000+b3000]

Here are the first few lines from i915_error_state:

Time: 1357847009 s 606144 us
PCI ID: 0x2a42
EIR: 0x00000000
IER: 0x02028c53
PGTBL_ER: 0x00000000
CCID: 0x00000000
  fence[0] = 00000000
  fence[1] = 00000000
  fence[2] = 00000000

Hope this helps....

Comment 43 Tom London 2013-01-12 02:02:55 UTC

I built kernel-3.8.0-0.rc3.git0.3.local.fc19.x86_64 with the one patch labelled 'drm-intel-fixes' in the above link.

I ran my crasher: "cat 40GB-files >/dev/null"; system stayed up, there was no spew in /var/log/messages, and system was stable.

Thanks!

Here is output of 'vmstat 10' during the disk traffic surge:

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 1  2      0 2332844  99788 604672    0    0  2262    73  834 1243 13  6 27 53
 0  2      0 1686000  99836 1268764    0    0 66922    56 1185 2209  7  4 26 63
 0  2      0 966036  99856 1987428    0    0 71821    30  868 1777  2  3 29 66
 0  2      0 204288  99936 2746552    0    0 75932   782  893 1774  2  3 30 65
 0  3      0 151216  29400 2871912    0    0 83392    42 1102 1820  2  5  7 87
 0  4      0 150760  29376 2873360    0    0 87950     0 1098 1875  2  4  0 94
 0  3      0 144736  31812 2880712    0    0 74862    54 1187 1954  2  4  0 94
 0  2      0 148012  33152 2878448    0    0 83268    80 1139 1918  2  4  9 84
 0  2      0 149312  32696 2877756    0    0 81625     2 1094 1831  2  4 33 61
 0  2      0 151232  32692 2878168    0    0 84898     0 1102 1894  2  4 33 61
 0  2      0 146352  32696 2885892    0    0 93866     1 1129 1894  2  4 33 61
 0  2      0 145628  31620 2895608    0    0 105160    76 1227 2056  2  5 34 59
 0  2      0 150792  30936 2889108    0    0 94103     2 1173 1934  2  5 34 59
 0  2      0 151032  30940 2889684    0    0 98985     2 1190 2011  2  5 33 59
 0  2      0 149740  30940 2892472    0    0 102569     0 1162 1978  2  5 35 58
 1  2      0 147280  31912 2889596    0    0 64857    12 1453 2815  7  8 19 67
 0  2      0 148320  31364 2878272    0    0 73233   180 1846 3628 11  9 16 64
 2  2      0 149648  31364 2890884    0    0 89531    15 1697 3587  9  9 27 56
 1  2      0 147136  31372 2893320    0    0 86727     0 1378 2484  4  8 29 59
 0  2      0 146868  31388 2892416    0    0 84028   152 1330 2456  4  6 29 60
 0  2      0 146492  31400 2892260    0    0 78779     5 1316 2442  4  5 30 61
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 3  2      0 149564  31236 2887672    0    0 78745    15 1454 2966  6  7 29 59
 0  2      0 149184  31196 2888136    0    0 74080    11 1466 2819  6  6 28 60
 0  2      0 147344  31196 2891156    0    0 77196     9 1241 2332  4  5 30 60
 0  2      0 147120  31208 2889356    0    0 73685    16 1183 2160  3  5 25 67
 0  3      0 150392  31236 2884580    0    0 69221    18 1200 2183  3  5 29 63
 1  2      0 149028  31248 2885824    0    0 75914     2 1272 2203  3  5 31 61
 0  2      0 146816  31264 2886108    0    0 70554    18 1342 2422  4  6 30 60
 0  2      0 146616  31296 2886848    0    0 72022  1202 1149 2014  4  5 28 63
 0  2      0 151444  31340 2881796    0    0 66190   257 1478 2575  9  7 22 61
 1  2      0 146656  31352 2885688    0    0 66033    27 1360 2499  4  6 30 60
 0  2      0 150356  31364 2879456    0    0 68884    33 1277 2383  4  6 28 62
 1  2      0 147388  31496 2878560    0    0 68414   164 1494 2739  7  8 25 60
 1  2      0 145040  31412 2883196    0    0 66066    21 1274 2281  3  5 30 62
 1  1      0 150928  31420 2875880    0    0 75336    20 1188 2089  4  5 32 59
 0  2      0 150160  31416 2853440    0    0 68478   167 1135 1920  4  5 29 62
 1  3      0 148228  31436 2852236    0    0 69452    21 1424 2743  5  7 27 60
 0  2      0 146688  31436 2852892    0    0 71651    12 1137 2001  3  6 25 66
 0  2      0 150668  31444 2847216    0    0 64634    15 1196 2013  3  5 20 73
 1  2      0 148196  31428 2849844    0    0 66958     0 1622 3516  8 10 27 56
 0  2      0 147388  31432 2850632    0    0 71242     4 1222 2113  4  6 29 61
 0  2      0 151420  31436 2845936    0    0 68421     0 1314 2233  3  8 30 59
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 0  3      0 149068  31436 2848756    0    0 70618     0 1455 2978  6  7 28 60
 1  2      0 146380  31436 2851932    0    0 67282     0 1125 1915  3  5 21 71
 0  2      0 145056  31440 2852548    0    0 71702     1 1078 1827  2  4 27 67
 0  2      0 148568  31484 2848428    0    0 58694    10 1374 2710  5  7 23 65
 0  2      0 147868  31488 2848808    0    0 69512     0 1068 1918  2  4 29 65
 0  2      0 146396  31488 2850440    0    0 68333    11 1021 1712  2  4 31 63
 0  2      0 147952  31500 2847680    0    0 64743     2 1219 2317  4  6 27 63
 0  2      0 151560  31508 2843916    0    0 65393     2 1194 2142  3  7 27 63
 1  1      0 139692  31484 2856432    0    0 24770     2 2246 2552  3 35 23 39
 0  1      0 122144  31484 2874040    0    0  2422     0  894 1770  3  2 59 37
 0  1      0 108100  31532 2888136    0    0  1413    23  677 1336  3  1 55 41
 0  1      0 149916  31540 2845932    0    0  1790     4  779 1592  3  2 54 41

Comment 44 Daniel Vetter 2013-01-14 17:36:39 UTC

Consolidating all gen4/5 i/o related hangs.

*** This bug has been marked as a duplicate of bug 55984 ***

Comment 45 Florian Mickler 2013-01-19 22:59:49 UTC

A patch referencing this bug report has been merged in Linux v3.8-rc4:

commit 93927ca52a55c23e0a6a305e7e9082e8411ac9fa
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Thu Jan 10 18:03:00 2013 +0100

    drm/i915: Revert shrinker changes from "Track unbound pages"

Comment 46 Jari Tahvanainen 2016-10-07 05:30:43 UTC

Patch has been merged (long ago). Closing.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.