Bug 57122 - [g4x regression] Graphics crash on Intel G41 under 3.7-rc3
Summary: [g4x regression] Graphics crash on Intel G41 under 3.7-rc3
Status: CLOSED DUPLICATE of bug 55984
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: highest major
Assignee: Imre Deak
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2012-11-14 16:41 UTC by Alex Villacís Lasso
Modified: 2016-10-07 05:33 UTC (History)
10 users (show)

See Also:
i915 platform:
i915 features:


Attachments
i915_error_state at time of crash, gzipped (202.25 KB, text/plain)
2012-11-14 16:41 UTC, Alex Villacís Lasso
no flags Details
dmesg output at time of crash, gzipped (21.21 KB, application/x-gzip)
2012-11-14 16:42 UTC, Alex Villacís Lasso
no flags Details
lspci output showing affected chipset (23.42 KB, text/plain)
2012-11-14 16:43 UTC, Alex Villacís Lasso
no flags Details
Xorg log at time of crash (65.41 KB, text/plain)
2012-11-14 16:44 UTC, Alex Villacís Lasso
no flags Details
disable unbound tracking (1.25 KB, patch)
2012-11-15 13:15 UTC, Daniel Vetter
no flags Details | Splinter Review
disable cpu relocs completely (543 bytes, patch)
2012-11-16 18:24 UTC, Daniel Vetter
no flags Details | Splinter Review
dmesg, second crash, 3.7-rc5, both test patches applied (69.22 KB, text/plain)
2012-11-19 16:21 UTC, Alex Villacís Lasso
no flags Details
i915_error_state, second crash, 3.7-rc5, both test patches applied (1.40 MB, text/plain)
2012-11-19 16:22 UTC, Alex Villacís Lasso
no flags Details
Kernel .config used to build failing kernels (121.02 KB, text/plain)
2012-11-19 17:10 UTC, Alex Villacís Lasso
no flags Details
3.7-rc .config (60.41 KB, text/plain)
2012-11-19 19:38 UTC, Andrew Clayton
no flags Details
i915_error_state, freeze with ickle/for-imre (1.48 MB, text/plain)
2012-11-21 19:10 UTC, Alex Villacís Lasso
no flags Details
dmesg, freeze with ickle/for-imre (69.58 KB, text/plain)
2012-11-21 19:11 UTC, Alex Villacís Lasso
no flags Details
Xorg log, freeze with ickle/for-imre (69.29 KB, text/plain)
2012-11-21 19:11 UTC, Alex Villacís Lasso
no flags Details
i915_error_state, freeze with ickle/bug55984 (1.48 MB, text/plain)
2012-11-23 19:36 UTC, Alex Villacís Lasso
no flags Details
dmesg, freeze with ickle/bug55984 (67.77 KB, text/plain)
2012-11-23 19:37 UTC, Alex Villacís Lasso
no flags Details
Xorg log, freeze with ickle/bug55984 (69.73 KB, text/plain)
2012-11-23 19:37 UTC, Alex Villacís Lasso
no flags Details
Don't force GTT/CPU relocations (7.18 KB, patch)
2012-11-26 09:54 UTC, Chris Wilson
no flags Details | Splinter Review
dmesg, freeze with dont-force-gpu-relocations patch (77.02 KB, text/plain)
2012-12-03 19:10 UTC, Alex Villacís Lasso
no flags Details
Keep reserved objects pinned until after reloction processing. (2.85 KB, patch)
2012-12-13 10:55 UTC, Chris Wilson
no flags Details | Splinter Review
Screenshot showing artifact (23.33 KB, image/png)
2012-12-17 19:31 UTC, Alex Villacís Lasso
no flags Details
dmesg with 3.7.0, CONFIG_PROVE_LOCKING=y (81.09 KB, text/plain)
2012-12-17 22:05 UTC, Alex Villacís Lasso
no flags Details
Debian-generated software info (82.71 KB, text/plain)
2012-12-19 03:38 UTC, Andreas Kloeckner
no flags Details
make the shrinker less aggressive (2.18 KB, patch)
2012-12-19 13:40 UTC, Daniel Vetter
no flags Details | Splinter Review
Align surface sizes to an even tile row (839 bytes, patch)
2012-12-21 13:51 UTC, Chris Wilson
no flags Details | Splinter Review
i915 error state from a non hung error state (191.25 KB, application/x-gzip)
2013-01-12 15:57 UTC, Andrew Clayton
no flags Details

Description Alex Villacís Lasso 2012-11-14 16:41:59 UTC
Created attachment 70081 [details]
i915_error_state at time of crash, gzipped

System is Fedora 16 x86_64, xorg-x11-server-Xorg-1.11.4-3.fc16.x86_64, xorg-x11-drv-intel-2.20.8-1.fc16.x86_64 gnome-shell-3.2.2.1-1.fc16.x86_64.

With distro-supplied kernel (kernel-3.6.6-1.fc16.x86_64), system works correctly. Also works correctly with self-compiled vanilla 3.6.0 kernel.

Since vanilla kernel 3.7-rc3 up to current 3.7-rc5, I have been experiencing random crashes of the graphic session. Always, the affected process is gnome-shell. I have currently no known way to induce the crash. The crash occurs randomly - it might happen a few minutes into the session, or it might not happen at all until I turn off the computer. Therefore it is hard for me to perform a bisection.

The crash only happens with my work computer (Intel G41 chipsed). My home computer runs 3.7-fc5 x86_64 in the same Fedora 16 setup without incidents, but it is an Intel G31 as far as I can remember.
Comment 1 Alex Villacís Lasso 2012-11-14 16:42:47 UTC
Created attachment 70082 [details]
dmesg output at time of crash, gzipped
Comment 2 Alex Villacís Lasso 2012-11-14 16:43:39 UTC
Created attachment 70083 [details]
lspci output showing affected chipset
Comment 3 Alex Villacís Lasso 2012-11-14 16:44:06 UTC
Created attachment 70084 [details]
Xorg log at time of crash
Comment 4 Chris Wilson 2012-11-14 16:47:14 UTC
Smells like bug 56916.
Comment 5 Andrew Clayton 2012-11-14 21:43:44 UTC
Just trnasferrring my information from the ML to here.

I saw this just the once recently, on 3.7-rc4+ The system was idle and the screen was blanked. I could get out to a virtual consolem but had to reboot to get 3D working again (2D graphics seemed to be working OK)

Nov 11 17:36:02 omega kernel: [drm:i915_hangcheck_hung] *ERROR*
Hangcheck timer elapsed... GPU hung Nov 11 17:36:02 omega kernel:
[drm:init_ring_common] *ERROR* render ring initialization failed ctl
0001f001 head 00003000 tail 00000000 start 00003000 Nov 11 17:36:03
omega kernel: [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer
elapsed... GPU hung Nov 11 17:36:03 omega kernel: [drm:i915_reset]
*ERROR* GPU hanging too fast, declaring wedged! Nov 11 17:36:03 omega
kernel: [drm:i915_reset] *ERROR* Failed to reset chip.

And it was also gnome-shell which was involved.

Nov 11 17:36:14 omega kernel: gnome-shell[15559]: segfault at 0 ip
00007fa0ac65b695 sp 00007fffddca2cd0 error 4 in
i965_dri.so[7fa0ac5f1000+3bf000]

Unfortunately I neglected to get the i915_error_state

This is also on a x86_64 Fedora 16 system with a G41

00:02.0 VGA compatible controller: Intel Corporation 4 Series Chipset
Integrated Graphics Controller (rev 03)

Now running 3.7-rc5 and not seen this happen again. I did see this thing after a while during Fedora 14 (running kernel.org kernels) but they went away after moving to Fedora 16, until this latest oddity.

Here's the mesa/libdrm/intel driver versions.

mesa-libGL-7.11.2-3.fc16.x86_64
mesa-libGLU-devel-7.11.2-3.fc16.x86_64
mesa-libGL-devel-7.11.2-3.fc16.x86_64
mesa-libGLU-7.11.2-3.fc16.x86_64
mesa-dri-drivers-7.11.2-3.fc16.x86_64
mesa-dri-filesystem-7.11.2-3.fc16.x86_64

libdrm-2.4.33-1.fc16.x86_64

xorg-x11-drv-intel-2.20.8-1.fc16.x86_64
Comment 6 Daniel Vetter 2012-11-15 13:15:52 UTC
Created attachment 70112 [details] [review]
disable unbound tracking

Silly me just noticed that the unbound tracking has been merged into 3.7, not 3.6. This has a big enough impact to explain all kinds of things. Please try the attached patch, thanks.
Comment 7 Alex Villacís Lasso 2012-11-15 15:38:38 UTC
Would this explain why the crash happens on my G41 at work but not on my G33 at home?
Comment 8 Chris Wilson 2012-11-15 15:40:50 UTC
No, they both utilize unbound pages (if you have the same kernel). The only significant difference will be mesa and the use of reloc-trees.
Comment 9 Daniel Vetter 2012-11-16 18:24:42 UTC
Created attachment 70170 [details] [review]
disable cpu relocs completely

I'm not completely sure, but I think we haven't ruled this one out yet. Please test, thanks
Comment 10 Alex Villacís Lasso 2012-11-16 20:03:46 UTC
Is this second patch supposed to be applied in addition to the first one, or instead of it? For now, I will assume it should be applied on top of the first one.
Comment 11 Daniel Vetter 2012-11-16 20:12:33 UTC
(In reply to comment #10)
> Is this second patch supposed to be applied in addition to the first one, or
> instead of it? For now, I will assume it should be applied on top of the
> first one.

Atm we're lacking a bit clue what's going on, so just a bunch of test patches. You can test them all at once, if it works we can figure out which one fixed things.
Comment 12 Alex Villacís Lasso 2012-11-19 16:18:21 UTC
My graphics session just crashed again in the exact same way as before. I was running 3.7-rc5 and it crashed in the middle of compiling 3.7-rc6.
Comment 13 Alex Villacís Lasso 2012-11-19 16:18:44 UTC
Forgot to tell, I had both test patches applied on top of -rc5.
Comment 14 Alex Villacís Lasso 2012-11-19 16:21:10 UTC
Created attachment 70265 [details]
dmesg, second crash, 3.7-rc5, both test patches applied
Comment 15 Alex Villacís Lasso 2012-11-19 16:22:19 UTC
Created attachment 70267 [details]
i915_error_state, second crash, 3.7-rc5, both test patches applied
Comment 16 Daniel Vetter 2012-11-19 16:28:52 UTC
Ok, yet another new theory ... please attach your kernel .config, thanks.
Comment 17 Alex Villacís Lasso 2012-11-19 17:10:38 UTC
Created attachment 70269 [details]
Kernel .config used to build failing kernels
Comment 18 Alex Villacís Lasso 2012-11-19 17:34:24 UTC
A possibly helpful thing: in my home machine (where the crash does not occur) a sample run of glxgears opens /usr/lib64/dri/i915_dri.so . In my work machine (where the crash occurs), glxgears opens /usr/lib64/dri/i965_dri.so .
Comment 19 Andrew Clayton 2012-11-19 19:38:47 UTC
Created attachment 70276 [details]
3.7-rc .config

I saw this again on Sunday with 3.7-rc6. Attached my .config and for what it's worth, my machine here is also loading i965_dri.so for glxgears
Comment 20 Chris Wilson 2012-11-20 15:19:44 UTC
Alex, do you mind giving the tree at http://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=for-imre a spin:

$ cd /path/to/linux
$ git remote add ickle -f git://people.freedesktop.org/~ickle/linux-2.6
$ git checkout ickle/for-imre

make; install; test
Comment 21 Alex Villacís Lasso 2012-11-20 16:46:33 UTC
Ok, will do. For reference:

[alex@avillacis linux-git]$ git checkout ickle/for-imre
Checking out files: 100% (1025/1025), done.
Note: checking out 'ickle/for-imre'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b new_branch_name

HEAD is now at ab5c8df... drm/i915: Preallocate next seqno before touching the ring
Comment 22 Chris Wilson 2012-11-20 16:59:46 UTC
I was going to suggest naming it bug57122 (git checkout -b bug57122 ickle/for-imre), but that branch is going to be pretty volatile and you only want it for smoketesting...
Comment 23 Alex Villacís Lasso 2012-11-20 19:19:24 UTC
I am now running the requested branch. However, I am again experiencing the "random stalls in graphics applications" issue that I reported in the mailing list back when I tested master 3.7-rc2 . Here is what I reported at that time:

--------start quote--------
I am testing linux-3.7-rc2 in Fedora 16 x86_64 in a workstation at my day job. My kernel configuration is attached. My graphics chipset shows up in lspci as follows:

00:02.0 VGA compatible controller [0300]: Intel Corporation 4 Series Chipset Integrated Graphics Controller [8086:2e32] (rev 03) (prog-if 00 [VGA controller])
    Subsystem: Intel Corporation Device [8086:d612]
    Flags: bus master, fast devsel, latency 0, IRQ 43
    Memory at d0000000 (64-bit, non-prefetchable) [size=4M]
    Memory at c0000000 (64-bit, prefetchable) [size=256M]
    I/O ports at f140 [size=8]
    Expansion ROM at <unassigned> [disabled]
    Capabilities: [90] MSI: Enable+ Count=1/1 Maskable- 64bit-
    Capabilities: [d0] Power Management version 2
    Kernel driver in use: i915
    Kernel modules: i915

All the tests were made with the default KMS enabled, as setup by the distro.

With the distro-supplied linux-3.6.2-1, I have no problems at all with graphics. Likewise with vanilla kernel 3.6.0.

With 3.7-rc1 onwards, and also 3.7-rc2, my workstation seems to boot normally and I can login into Gnome Shell. However, after a while, some random graphical client with which I am interacting stops responding. This stall is of random length - sometimes it lasts a fraction of a second, or any interval up to a few minutes. It seems that anything that causes the app to try to draw to the screen might cause the stall, even something as simple as switching to the app. So far, I have seen stalls while using the following apps: firefox, thunderbird, eclipse, gnome-terminal, and even gnome-shell itself. When gnome-shell is affected, the entire desktop freezes, and becomes unusable. However, in all cases (even gnome-shell stalls), the mouse cursor can be moved, and I can switch into other consoles with Ctrl-Alt-Fn very easily. When I switch to a console, I can run top, and it shows me that the stalled application is apparently burning CPU time at 99%, but the corresponding CPU is busy in "system" time, not "user" time.

Interestingly, the xserver process itself has never been seen burning CPU when these stalls happen.

I have tried killing the stalled gnome-shell with "kill -9 PID", but it proved unkillable even by this. However I then restarted the X server with Ctrl-Alt-Backspace, and this managed to terminate the same unkillable gnome-shell.

I have also attached gdb to the stalled processes. However, the symbol loading goes by at a snails pace, which is unusual. After that I managed to issue bt on two processes. The results are attached. In both backtraces, the innermost function is writev initiated by XPutImage.

I have seen nothing unusual for me in the dmesg output (attached) even with a stalled process running.

It seems that this problem is unique to my workstation. My home machine has a different Intel chipset (G31 if I remember correctly) but also runs Fedora 16 x86_64 with 3.7-rc2, and has never been affected by this issue.
--------end quote--------
Comment 24 Alex Villacís Lasso 2012-11-20 19:32:46 UTC
Back when I was testing master 3.7-rc2, I was asked to post /proc/PID/stack for the affected process. This is what I get:

[alex@avillacis ~]$ cat /proc/2535/stack 
[<ffffffffffffffff>] 0xffffffffffffffff
Comment 25 Alex Villacís Lasso 2012-11-20 20:10:12 UTC
In the test kernel, sometimes I catch the stalling process with this stack:

[<ffffffff816422e6>] retint_kernel+0x26/0x30
[<ffffffff81149feb>] shrink_page_list+0x68b/0xa00
[<ffffffff8114a8cf>] shrink_inactive_list+0x18f/0x450
[<ffffffff8114b318>] shrink_lruvec+0x448/0x560
[<ffffffff8114b4a5>] shrink_zone+0x75/0xa0
[<ffffffff8114b63b>] zone_reclaim+0x16b/0x270
[<ffffffff8113f6c1>] get_page_from_freelist+0x511/0x740
[<ffffffff8113fa98>] __alloc_pages_nodemask+0x1a8/0x9f0
[<ffffffff8117e6a3>] alloc_pages_vma+0xb3/0x190
[<ffffffff81160e39>] handle_pte_fault+0x709/0xab0
[<ffffffff81162469>] handle_mm_fault+0x269/0x340
[<ffffffff8164540c>] __do_page_fault+0x16c/0x5a0
[<ffffffff8164584e>] do_page_fault+0xe/0x10
[<ffffffff81642488>] page_fault+0x28/0x30
[<ffffffffffffffff>] 0xffffffffffffffff
Comment 26 Chris Wilson 2012-11-20 20:18:03 UTC
(In reply to comment #25)
> In the test kernel, sometimes I catch the stalling process with this stack:
> 
> [<ffffffff816422e6>] retint_kernel+0x26/0x30
> [<ffffffff81149feb>] shrink_page_list+0x68b/0xa00
> [<ffffffff8114a8cf>] shrink_inactive_list+0x18f/0x450
> [<ffffffff8114b318>] shrink_lruvec+0x448/0x560
> [<ffffffff8114b4a5>] shrink_zone+0x75/0xa0
> [<ffffffff8114b63b>] zone_reclaim+0x16b/0x270
> [<ffffffff8113f6c1>] get_page_from_freelist+0x511/0x740
> [<ffffffff8113fa98>] __alloc_pages_nodemask+0x1a8/0x9f0
> [<ffffffff8117e6a3>] alloc_pages_vma+0xb3/0x190
> [<ffffffff81160e39>] handle_pte_fault+0x709/0xab0
> [<ffffffff81162469>] handle_mm_fault+0x269/0x340
> [<ffffffff8164540c>] __do_page_fault+0x16c/0x5a0
> [<ffffffff8164584e>] do_page_fault+0xe/0x10
> [<ffffffff81642488>] page_fault+0x28/0x30
> [<ffffffffffffffff>] 0xffffffffffffffff

There's a direct-reclaim bug that matches this description in that kernel that is yet to be resolved upstream. Can you please keep testing and comparing the stacks of when it is stalled (or try sudo perf top) and see if it is always the same (or at least similar)?
Comment 27 Alex Villacís Lasso 2012-11-20 21:24:00 UTC
The thing is, this random-stall situation disappeared under 3.7-rc3. Hmmm... the same kernel when I first noticed the graphics crash.
Comment 28 Alex Villacís Lasso 2012-11-20 22:25:37 UTC
Sorry, I cannot keep testing ickle/for-imre because the frequent stalls make the session essentially unusable. Please remember this is my day job machine.
Comment 29 Chris Wilson 2012-11-21 12:22:08 UTC
So linus/master seems better behaved, so I pushed the merged branch to ickle/for-imre. Alex, if you feel brave... Thanks.
Comment 30 Alex Villacís Lasso 2012-11-21 19:09:50 UTC
After an hour of testing the merge of ickle/for-imre, the graphics session froze, but this time it did not segfault gnome-shell. The screen just froze when I was doing the gesture of moving the mouse pointer to the top-left corner in order to dismiss the gnome-shell window expose. Through a remote ssh connection, I was able to collect some debugging information.
Comment 31 Alex Villacís Lasso 2012-11-21 19:10:38 UTC
Created attachment 70387 [details]
i915_error_state, freeze with ickle/for-imre
Comment 32 Alex Villacís Lasso 2012-11-21 19:11:03 UTC
Created attachment 70388 [details]
dmesg, freeze with ickle/for-imre
Comment 33 Alex Villacís Lasso 2012-11-21 19:11:33 UTC
Created attachment 70389 [details]
Xorg log, freeze with ickle/for-imre
Comment 34 Alex Villacís Lasso 2012-11-21 19:12:16 UTC
Forgot to mention. The merged ickle/for-imre did not exhibit any random stalls at all before the freeze.
Comment 35 Chris Wilson 2012-11-21 22:05:45 UTC
Well the good news is that is a completely different bug. Again it should be impossible...
Comment 36 Chris Wilson 2012-11-22 08:46:07 UTC
I've put a smaller selection of patches in ickle/bug55984. It's still a shotgun approach, but a good first step will be to see if it cures the hang..
Comment 37 Alex Villacís Lasso 2012-11-23 19:35:51 UTC
The ickle/bug55984 branch still hangs on me after about one hour of use. I will post the debugging files again for this hang.
Comment 38 Alex Villacís Lasso 2012-11-23 19:36:33 UTC
Created attachment 70484 [details]
i915_error_state, freeze with ickle/bug55984
Comment 39 Alex Villacís Lasso 2012-11-23 19:37:08 UTC
Created attachment 70485 [details]
dmesg, freeze with ickle/bug55984
Comment 40 Alex Villacís Lasso 2012-11-23 19:37:28 UTC
Created attachment 70486 [details]
Xorg log, freeze with ickle/bug55984
Comment 41 Chris Wilson 2012-11-26 09:54:13 UTC
Created attachment 70578 [details] [review]
Don't force GTT/CPU relocations

Today's patch, please test.
Comment 42 Alex Villacís Lasso 2012-11-26 15:39:11 UTC
Unable to apply cleanly:

[alex@avillacis linux-ickle-bug55984]$ patch -p1 --dry-run  < ../0001-drm-i915-Avoid-forcing-relocations-through-the-mappa.patch
patching file drivers/gpu/drm/i915/i915_gem_execbuffer.c
Hunk #1 succeeded at 37 with fuzz 2 (offset 4 lines).
Hunk #2 FAILED at 98.
Hunk #3 FAILED at 205.
Hunk #4 FAILED at 231.
Hunk #5 FAILED at 335.
Hunk #6 FAILED at 352.
Hunk #7 FAILED at 424.
Hunk #8 FAILED at 435.
Hunk #9 FAILED at 467.
Hunk #10 FAILED at 476.
Hunk #11 FAILED at 659.
10 out of 11 hunks FAILED -- saving rejects to file drivers/gpu/drm/i915/i915_gem_execbuffer.c.rej

I am trying to apply this on top of ickle/bug55984 .
Comment 43 Chris Wilson 2012-11-26 15:47:55 UTC
Sorry, I wasn't clear in my intentions. Can you please apply this to 3.7-rc7 or drm-intel-fixes?
Comment 44 Alex Villacís Lasso 2012-11-26 16:01:59 UTC
Ok. Applied on top of vanilla 3.7-rc7 with previous test patches removed.
Comment 45 Alex Villacís Lasso 2012-12-03 16:02:18 UTC
(In reply to comment #44)
> Ok. Applied on top of vanilla 3.7-rc7 with previous test patches removed.

Sorry for late response. I suddenly realized that I had VirtualBox running. So I had to remove it as a possible factor on the crash. I tested 3.7-rc7 *without*  "Don't force GTT/CPU relocations" patch and with KVM/QEMU instead of VirtualBox for my virtual machines, and it ran correctly until today, in which the graphics display hung in the exact same way as last time. Now I will test the patch, again without VirtualBox.
Comment 46 Alex Villacís Lasso 2012-12-03 19:10:51 UTC
Created attachment 70976 [details]
dmesg, freeze with dont-force-gpu-relocations patch

Once again, my session crashed with the dont-force-gpu-relocations patch applied. I was running a KVM/QEMU virtual machine, and scrolling down on a page in firefox. Furthermore, I was unable to capture the i915_error_state file, because both cat and cp complained "out of memory" when trying to read the report. There are some backtraces on the dmesg output. Do they give some clue?
Comment 47 Chris Wilson 2012-12-04 16:20:17 UTC
One thing that I have asked elsewhere on a similar bug is to see if you can reproduce the failure with SNA and attach that error-state. If it does reoccur, due to the different layout of the batchbuffer, we can get a fair amount of auxiliary data which may yield a clue.
Comment 48 Alex Villacís Lasso 2012-12-04 22:49:24 UTC
Sorry, I am not familiar at all with "SNA". What is it? What do I have to do to use it?
Comment 49 Chris Wilson 2012-12-04 22:52:36 UTC
Add

Section "Device"
  Identity "Device0"
  Driver "intel"
  Option "AccelMethod" "sna"
EndSection

to your xorg.conf (or as a snippet in xorg.conf.d).
Comment 50 Alex Villacís Lasso 2012-12-11 16:13:25 UTC
I have run 3.7.0-rc8 with SNA until today without being able to trigger the graphics crash. I am being careful to run QEMU/KVM instead of VirtualBox in order to avoid introducing taint in the kernel. However, I notice some graphic artifacts, such as the middle button of the windows (the one that maximises the window) being blank with the background color, but drawing itself correctly when I hover the mouse over it. I will now test just-released 3.7.0 *without* SNA to see if there is any change.
Comment 51 Chris Wilson 2012-12-11 16:19:15 UTC
(In reply to comment #50)
> However, I notice some
> graphic artifacts, such as the middle button of the windows (the one that
> maximises the window) being blank with the background color, but drawing
> itself correctly when I hover the mouse over it. 

The GPU is buggy and I'm trying to find a workaround that doesn't kill performance...
Comment 52 Alex Villacís Lasso 2012-12-11 21:57:19 UTC
(In reply to comment #50)
> I have run 3.7.0-rc8 with SNA until today without being able to trigger the
> graphics crash. I am being careful to run QEMU/KVM instead of VirtualBox in
> order to avoid introducing taint in the kernel. However, I notice some
> graphic artifacts, such as the middle button of the windows (the one that
> maximises the window) being blank with the background color, but drawing
> itself correctly when I hover the mouse over it. I will now test
> just-released 3.7.0 *without* SNA to see if there is any change.

Unfortunately 3.7.0 still exhibits the same graphics crash without SNA - it crashed after a few hours of use. I am now using SNA just so that I have a stable system.
Comment 53 Chris Wilson 2012-12-13 10:55:23 UTC
Created attachment 71440 [details] [review]
Keep reserved objects pinned until after reloction processing.

An idea. It should be impossible...
Comment 54 Alex Villacís Lasso 2012-12-13 15:41:44 UTC
(In reply to comment #53)
> Created attachment 71440 [details] [review] [review]
> Keep reserved objects pinned until after reloction processing.
> 
> An idea. It should be impossible...

Applied on top of 3.7. Will start testing shortly, without SNA.
Comment 55 Alex Villacís Lasso 2012-12-13 17:38:59 UTC
Bad luck. The crash still happens after applying the patch. Again, I was unable to capture i915_error_state due to "out of memory" errors.
Comment 56 Alex Villacís Lasso 2012-12-17 15:17:30 UTC
BTW, the issue of graphic artifacts when using SNA under 3.7 should be considered a regression. The 3.6.7-4.fc16.x86_64 distro-supplied kernel does not exhibit said artifacts with SNA.
Comment 57 Chris Wilson 2012-12-17 15:20:15 UTC
(In reply to comment #56)
> BTW, the issue of graphic artifacts when using SNA under 3.7 should be
> considered a regression. The 3.6.7-4.fc16.x86_64 distro-supplied kernel does
> not exhibit said artifacts with SNA.

Ok, then it is not the artifacts I'm aware of (I guess). Can you please try to grab photo or screenshot?
Comment 58 Alex Villacís Lasso 2012-12-17 19:31:26 UTC
Created attachment 71682 [details]
Screenshot showing artifact

This is the artifact I am seen most frequently with SNA (all the time, all the windows). The middle button of the window decoration (gnome-shell) is supposed to show the maximize icon, but instead is blank. This does not occur with the 3.6.x kernel, or with the default (crashing) UXA acceleration.
Comment 59 Alex Villacís Lasso 2012-12-17 19:34:28 UTC
Is there any monitoring I could perform in the background under 3.7 that will provide information on the root cause of the crash, *before* said crash happens?
Comment 60 Chris Wilson 2012-12-17 19:45:09 UTC
(In reply to comment #58)
> Created attachment 71682 [details]
> Screenshot showing artifact
> 
> This is the artifact I am seen most frequently with SNA (all the time, all
> the windows). The middle button of the window decoration (gnome-shell) is
> supposed to show the maximize icon, but instead is blank. This does not
> occur with the 3.6.x kernel, or with the default (crashing) UXA acceleration.

I could have sworn that was the CompositeTrapezoids Damage bug (lack of Damage notification sent). But that should also be the case with 3.6. It looks like it should be a small inline trapezoid, in which case it will be upload through an async buffer (either snooped or GTT depending upon state and kernel.) Daniel, can you remember when set_cacheing finally landed? That might indeed be 3.7.


(In reply to comment #59)
> Is there any monitoring I could perform in the background under 3.7 that
> will provide information on the root cause of the crash, *before* said crash
> happens?

Well, you've debunked my best ideas so far. I'm convinced that the key difference is in the relocation-*tree* used by UXA. But not yet sure how the bug is manifesting itself. In that scenario the most likely culprit is that we reuse stale relocation entries believing that they are valid. Perhaps if we always forced the relocations?
Comment 61 Alex Villacís Lasso 2012-12-17 22:05:50 UTC
Created attachment 71702 [details]
dmesg with 3.7.0, CONFIG_PROVE_LOCKING=y

In an attempt to get some information, I recompiled the kernel with CONFIG_PROVE_LOCKING=y, and added slub_debug=FZPU to the kernel command line. Then, I set the acceleration back to UXA, and left a slabinfo -v running every 5 seconds in the background. After that I started a KVM virtual machine, and some time after that, I got a lock ordering warning in the attached dmesg. Does this shed some light on the graphics issue, or is this a completely separate bug? If not related where should I report this?
Comment 62 Daniel Vetter 2012-12-17 23:28:57 UTC
Locksplat of zcache vs. pagecache afaict. I'd suggest to send that thing to the linux-kernel mailing list, cc fs-devel directly. Shouldn't be related to the gfx issue at hand here.
Comment 63 Daniel Vetter 2012-12-18 10:56:42 UTC
Please try out the patch at

https://patchwork.kernel.org/patch/1885411/

It has a decent chance to reduce gtt trashing, which might be good enough to again ducttape over the hangs. Or maybe change the pattern to be able to reproduce it much quicker. In any case, should be interesting ...
Comment 64 Alex Villacís Lasso 2012-12-18 18:43:24 UTC
I tried the patch at https://patchwork.kernel.org/patch/1885411/ . After a few hours of use, the system failed, but in a different way. All of a sudden, the graphical desktop became unresponsive. No mouse movement, no keyboard response, keyboard leds could not be toggled, Ctrl-Alt-Backspace did not work. I sshd into the machine, and "top" showed the Xorg process at 99% system time in one CPU. All attempts to kill Xorg failed, even with kill -9. All attempts to attach to the process with gdb hung. A controlled reboot via ssh also hung, so I had to hard-reset the machine. No error state was collected in i915_error_state, and there was no DRI-related backtrace in the error log.
Comment 65 Alex Villacís Lasso 2012-12-18 18:44:41 UTC
(In reply to comment #64)
> I tried the patch at https://patchwork.kernel.org/patch/1885411/ . After a
> few hours of use, the system failed, but in a different way. All of a
> sudden, the graphical desktop became unresponsive. No mouse movement, no
> keyboard response, keyboard leds could not be toggled, Ctrl-Alt-Backspace
> did not work. I sshd into the machine, and "top" showed the Xorg process at
> 99% system time in one CPU. All attempts to kill Xorg failed, even with kill
> -9. All attempts to attach to the process with gdb hung. A controlled reboot
> via ssh also hung, so I had to hard-reset the machine. No error state was
> collected in i915_error_state, and there was no DRI-related backtrace in the
> error log.

BTW, this was with UXA acceleration, not SNA.
Comment 66 Chris Wilson 2012-12-18 20:26:33 UTC
(In reply to comment #64)
> I tried the patch at https://patchwork.kernel.org/patch/1885411/
[snip]
>I sshd into the machine, and "top" showed the Xorg process at
> 99% system time in one CPU.

Daniel, that's the bug I thought was elsewhere. Basically we evict something to make room, but then fail to find a hole. I thought it was my create top-down that was broken, but there be dragons.
Comment 67 Andreas Kloeckner 2012-12-19 03:38:09 UTC
Created attachment 71780 [details]
Debian-generated software info

FWIW, I'm seeing the same symptoms (gnome-shell hangs, mouse movable, killall -9 gnome-shell revives)with 2.20.14 on

Linux ding 3.5-trunk-amd64 #1 SMP Debian 3.5.5-1~experimental.1 x86_64 GNU/Linux

with Mesa 8.0.5. I believe this started when I upgraded libdrm, the intel DDX and Mesa a while back. Kernel stayed constant.
Comment 68 Daniel Vetter 2012-12-19 13:40:55 UTC
Created attachment 71806 [details] [review]
make the shrinker less aggressive

Duct-tape solution if it is one, but imo very much worth a try.
Comment 69 Alex Villacís Lasso 2012-12-19 15:43:20 UTC
(In reply to comment #68)
> Created attachment 71806 [details] [review] [review]
> make the shrinker less aggressive
> 
> Duct-tape solution if it is one, but imo very much worth a try.

Applying on top of vanilla 3.7.0 and  https://patchwork.kernel.org/patch/1885411/ .
Comment 70 Alex Villacís Lasso 2012-12-19 17:36:34 UTC
No luck. Both patches together still result in Xorg spinning and eating all the CPU in system mode after two hours of normal use. I had to hard-reset the machine again.
Comment 71 Daniel Vetter 2012-12-20 09:52:04 UTC
(In reply to comment #70)
> No luck. Both patches together still result in Xorg spinning and eating all
> the CPU in system mode after two hours of normal use. I had to hard-reset
> the machine again.

Please test again only with the "make shrinker less aggressive" patch, the former patch seems to be broken somehow and my patch doesn't try to fix that. So same "X stuck spinning" symptoms are still expect.
Comment 72 Chris Wilson 2012-12-21 13:51:28 UTC
Created attachment 71932 [details] [review]
Align surface sizes to an even tile row
Comment 73 Alex Villacís Lasso 2012-12-27 15:02:05 UTC
(In reply to comment #71)
> (In reply to comment #70)
> > No luck. Both patches together still result in Xorg spinning and eating all
> > the CPU in system mode after two hours of normal use. I had to hard-reset
> > the machine again.
> 
> Please test again only with the "make shrinker less aggressive" patch, the
> former patch seems to be broken somehow and my patch doesn't try to fix
> that. So same "X stuck spinning" symptoms are still expect.

Running 3.7.0 with "make shrinker less aggressive" patch only. So far, two days without graphics issues. Seems good, but I will keep testing. In a prior test, the machine lasted a week before a graphics crash.
Comment 74 Chris Wilson 2012-12-30 10:39:07 UTC
xf86-video-intel commit 736b89504a32239a0c7dfb5961c1b8292dd744bd
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sun Dec 30 10:32:18 2012 +0000

    uxa: Align surface allocations to even tile rows
    
    Align surface sizes to an even number of tile rows to cater for sampler
    prefetch. If we read beyond the last page we may catch the PTE in a
    state of flux and trigger a GPU hang. Also detected by enabling invalid
    PTE access checking.
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=56916
    References: https://bugs.freedesktop.org/show_bug.cgi?id=55984
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk
Comment 75 Alex Villacís Lasso 2013-01-03 14:43:43 UTC
My machine is still running stable after one more day with the less-aggressive-shrinker kernel patch. Meanwhile, 3.8-rc2 is out. Should I try to upgrade to this kernel version? Should I apply the less-aggressive-shrinker kernel patch on this kernel too?

BTW, I am not currently testing any of the xf86-video-intel patches. I am still using the distro-supplied version xorg-x11-drv-intel-2.20.8-1.fc16.x86_64 .
Comment 76 Chris Wilson 2013-01-03 16:14:18 UTC
(In reply to comment #75)
> My machine is still running stable after one more day with the
> less-aggressive-shrinker kernel patch. Meanwhile, 3.8-rc2 is out. Should I
> try to upgrade to this kernel version? Should I apply the
> less-aggressive-shrinker kernel patch on this kernel too?

There is no patch for this issue upstream yet. And the less-aggressive-shrinker is not the forerunner of potential workaround patches. (The fixed version of #63 is a better choice since it actually fixes a real bug and has a side-effect of also reducing the likelihood of triggering this hang.)

> BTW, I am not currently testing any of the xf86-video-intel patches. I am
> still using the distro-supplied version
> xorg-x11-drv-intel-2.20.8-1.fc16.x86_64 .

This is most likely to be the root cause of the problem.
Comment 77 Alex Villacís Lasso 2013-01-03 17:13:34 UTC
So I should be trying an updated xorg-intel driver on top of an *unpatched* kernel? I thought that, since the distro-supplied kernel works fine with the distro-supplied xorg-intel driver, and the failure occurs only if I swap the kernel, it must therefore be a kernel bug.
Comment 78 Alex Villacís Lasso 2013-01-03 17:47:31 UTC
I have successfully compiled xf86-video-intel at fc702cdf534a4694a64408428e8933497a7fc06e and it appears to run correctly under patched 3.7.0 kernel. I will now compile unpatched 3.8-rc2 and see what happens.
Comment 79 Alex Villacís Lasso 2013-01-04 18:11:48 UTC
Bad luck. The unpatched 3.8-rc2 kernel crashed on me just a moment ago, even with the updated xorg-intel driver. Same symptoms as before.
Comment 80 Chris Wilson 2013-01-04 18:36:08 UTC
Thanks, that is useful to know. Still at a loss to explain this, except that we know it has to do with surface evictions and the processing of the relocation tree.
Comment 81 Daniel Vetter 2013-01-10 17:14:52 UTC
Everyone please retest with latest drm-intel-fixes from

http://cgit.freedesktop.org/~danvet/drm-intel

I've just merged a bunch of duct-tapes for this issue.
Comment 82 Andrew Clayton 2013-01-12 15:54:17 UTC
Been testing 3.8.0-rc3-00074-gb719f43 under an up to date 64bit Fedora 16, with a gen4 G41

00:02.0 VGA compatible controller: Intel Corporation 4 Series Chipset Integrated Graphics Controller (rev 03)

So far the bug seems to be sufficiently hidden again ;)

However at some point I have had this.

[drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
i915: render error detected, EIR: 0x00000010
i915:   IPEIR: 0x00000000
i915:   IPEHR: 0x01000000
i915:   INSTDONE_0: 0xfffffffe
i915:   INSTDONE_1: 0xffffffff
i915:   INSTDONE_2: 0x00000000
i915:   INSTDONE_3: 0x00000000
i915:   INSTPS: 0x0001e000
i915:   ACTHD: 0xd2c08eb8
i915: page table error
i915:   PGTBL_ER: 0x00000002
[drm:i915_report_and_clear_eir] *ERROR* EIR stuck: 0x00000010, masking

Though I've not noticed any ill effects yet.

Cheers,
Andrew
Comment 83 Andrew Clayton 2013-01-12 15:57:02 UTC
Created attachment 72905 [details]
i915 error state from a non hung error state

Attached the i915_error_state to go along with my previous comment in case it's useful
Comment 84 Daniel Vetter 2013-01-14 17:35:53 UTC
Consolidating all gen4/5 i/o related hangs.

*** This bug has been marked as a duplicate of bug 55984 ***
Comment 85 Florian Mickler 2013-01-19 23:01:15 UTC
A patch referencing this bug report has been merged in Linux v3.8-rc4:

commit 93927ca52a55c23e0a6a305e7e9082e8411ac9fa
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Thu Jan 10 18:03:00 2013 +0100

    drm/i915: Revert shrinker changes from "Track unbound pages"
Comment 86 Jari Tahvanainen 2016-10-07 05:33:56 UTC
Patch merged, closing.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.