Bug 43427

Summary:

[G33] bad tiling when visiting some pages in firefox

Product:

DRI

Reporter:

Jiri Slaby <jirislaby>

Component:

DRM/Intel

Assignee:

Daniel Vetter <daniel>

Status:

CLOSED FIXED

QA Contact:

Severity:

normal

Priority:

medium

CC:

ben, chris, daniel, freedesktop-bugzilla, jbarnes, kan.liang

Version:

XOrg git

Hardware:

x86-64 (AMD64)

OS:

Linux (All)

Whiteboard:

i915 platform:

i915 features:

Bug Depends on:

Bug Blocks:

42991, 44622

Attachments:

Description	Flags
xorg.log	none
full dmesg	none
intel_reg_dumper output	none
xorg with ddx 2.17	none
i915_error_state after "Bad Tiling" error	none
xrandr output for details of the display config	none
xrandr output for details of the display config (this time as text/plain, sorry)	none
example corruption linked from the lkml discussion	none
Different kind of finish_gpu patch	none
error state from today	none
error state from today	none
Maintain fenced gpu access until flushed	none
915_error_state with "Maintain fenced gpu access until flushed" only	none
915_error_state with both patches	none
Mark untiled BLT commands as fenced	none
xorg.log with 2.18 (UXA)	none

Description Jiri Slaby 2011-12-01 08:40:10 UTC

Created attachment 54019 [details]
xorg.log

Originally reported at:
https://lkml.org/lkml/2011/12/1/208

> - Can you check whether upgrading the ddx

Sorry, what is ddx?

> - If you're using swap, can you check whether disabling it works around
>  the issue?

No, I have no swap.

Comment 1 Jiri Slaby 2011-12-01 08:41:16 UTC

Created attachment 54020 [details]
full dmesg

Comment 2 Jiri Slaby 2011-12-01 08:41:47 UTC

Created attachment 54021 [details]
intel_reg_dumper output

Comment 3 Daniel Vetter 2011-12-01 08:48:26 UTC

ddx = X driver = xf86-video-intel, 2.17 is the latest release.

Comment 4 Jiri Slaby 2011-12-01 08:57:00 UTC

Created attachment 54023 [details]
xorg with ddx 2.17

Comment 5 Jiri Slaby 2011-12-01 08:58:34 UTC

It still happens with 2.17. Do you need reg dump with the driver?

Comment 6 Michael Karcher 2011-12-04 14:03:55 UTC

Created attachment 54115 [details]
i915_error_state after "Bad Tiling" error

I received a "bad tiling" crash on a Thinkpad T60, with a 945GM chipset, while trying to toy around with the i915 gallium driver.

00:02.0 VGA compatible controller [0300]: Intel Corporation Mobile 945GM/GMS, 943/940GML Express Integrated Graphics Controller [8086:27a2] (rev 03)

My memory configuration is dual channel asymmetric (2GB + 1GB), currently using an external monitor at 1920x1080 connected via DVI. The kernel messages are

[drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
render error detected, EIR: 0x00000010
page table error
  PGTBL_ER: 0x00000040
[drm:i915_report_and_clear_eir] *ERROR* EIR stuck: 0x00000010, masking
render error detected, EIR: 0x00000010
page table error
  PGTBL_ER: 0x00000040
[drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[drm:i915_wait_request] *ERROR* i915_wait_request returns -11 (awaiting 19397591 at 19397588, next 19397602)
[drm:i915_reset] *ERROR* Failed to reset chip.

After that error, I observed minor redraw issues in the GUI, probably related to the crashed GPU, but X11 is still mostly working, which is kind-of surprising to me (I am using xfce4 without compiz at the moment, so no 3D/compositing should be active). The new attachment shows the i915_error_state output.

Comment 7 Michael Karcher 2011-12-04 14:10:21 UTC

Created attachment 54116 [details]
xrandr output for details of the display config

Comment 8 Michael Karcher 2011-12-04 14:19:29 UTC

Created attachment 54117 [details]
xrandr output for details of the display config (this time as text/plain, sorry)

Comment 9 Michael Karcher 2011-12-04 14:57:28 UTC

Accompanied with the GPU crash is the X server disabling acceleration (which explains why X still works):

[ 28347.679] [mi] EQ overflowing. The server is probably stuck in an infinite loop.
[ 28347.679] 
Backtrace:
[ 28347.830] 0: /usr/bin/Xorg (xorg_backtrace+0x26) [0x7f160d4e58f6]
[ 28347.830] 1: /usr/bin/Xorg (mieqEnqueue+0x191) [0x7f160d4c6201]
[ 28347.830] 2: /usr/bin/Xorg (0x7f160d361000+0x65224) [0x7f160d3c6224]
[ 28347.830] 3: /usr/bin/Xorg (xf86PostMotionEventP+0x4a) [0x7f160d400b4a]
[...]
[ 28347.831] 22: /usr/bin/Xorg (0x7f160d361000+0x414ad) [0x7f160d3a24ad]
[ 28348.024] (EE) intel(0): Detected a hung GPU, disabling acceleration.
[ 28348.024] (EE) intel(0): When reporting this, please include i915_error_state from debugfs and the full dmesg.

(dmesg and i915_error_state are already quoted)

Comment 10 Jiri Slaby 2012-01-21 05:46:42 UTC

I still see the invalid tiles, should I try to update ddx (and depending libdrm)?

Note that the two other reports seem to be completely different. My GPU does not get stuck. IT just misrenderes e.g. some map tiles when browsing mapy.cz.

Comment 11 Daniel Vetter 2012-01-21 06:52:15 UTC

Michael Kracher, can you please file a separate bug for your issue?

Comment 12 Daniel Vetter 2012-01-21 06:53:50 UTC

Created attachment 55910 [details]
example corruption linked from the lkml discussion

Comment 13 Jiri Slaby 2012-02-07 07:52:31 UTC

I bisected the driver. It lead me to this commit from 2.13:
commit cc930a37612341a1f2457adb339523c215879d82
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sun Nov 14 19:47:00 2010 +0000

    uxa: Relax fencing some more for gen3


If I revert that on the top of 2.17.0, it works fine. So far.

Comment 14 Jiri Slaby 2012-02-13 02:10:27 UTC

(In reply to comment #13)
> If I revert that on the top of 2.17.0, it works fine. So far.

Heh, this really fixed all the symptoms:
* bad tiles in maps
* 2 GPU hangs I encountered a week
* video stops playing in kaffeine when browsing maps or doing other gfx intensive work

Comment 15 Chris Wilson 2012-02-13 03:00:22 UTC

Jiri, what GPU hangs? You haven't attached any example i915_error_states? There is one particularly nasty bug, which has been reported to cause tiling corruption, and is likely to be the culprit here. Turning off relaxed-fencing is just likely to lower the reuse rate of bo, increase aperture thrashing and just make that exact path harder to hit (in particular it will eliminate the reuse of render targets as batch buffers which is the crux of the hangs). But if my guess is correct, and there is so far no evidence to suggest otherwise ;-), then it won't eliminate the risk of that hang entirely.

Comment 16 Jiri Slaby 2012-02-13 03:06:49 UTC

(In reply to comment #15)
> Jiri, what GPU hangs?

Oh, I thought there is a link. Apparently not. Here it is:
https://lkml.org/lkml/2012/1/24/165

> You haven't attached any example i915_error_states? There is one particularly nasty bug, which has been reported to cause tiling
> corruption, and is likely to be the culprit here.

Do you mean "drm/i915: Only clear the GPU domains upon a successful finish"? Without that patch I see daily GPU hangs. With that patch this was reduced to 2 hangs per week.

> Turning off relaxed-fencing is just likely to lower the reuse rate of bo, increase aperture thrashing and
> just make that exact path harder to hit (in particular it will eliminate the reuse of render targets as batch buffers which is the crux of the hangs). But if
> my guess is correct, and there is so far no evidence to suggest otherwise ;-), then it won't eliminate the risk of that hang entirely.

Ok, if you have any ideas what to test, let me know. The revert, as a workaround, allows me to work at least :).

Comment 17 Chris Wilson 2012-02-13 09:24:23 UTC

Can you please attach an example of the current hangs with the finish-gpu patch applied? I'm hoping that they follow a different pattern and are either a userspace driver bug, or might shed light on the use-after-free bug that we have been theorizing exists in the kernel. Or it could be much more mundane.

Comment 18 Daniel Vetter 2012-02-13 09:47:50 UTC

Created attachment 56981 [details] [review]
Different kind of finish_gpu patch

Can you also try this patch _instead_ of the finish_gpu one you're currently using? If this also ends up in a gpu hang, please attach the error_state.

Comment 19 Jiri Slaby 2012-02-14 01:33:18 UTC

(In reply to comment #17)
> Can you please attach an example of the current hangs with the finish-gpu patch applied? I'm hoping that they follow a different pattern and are either a
> userspace driver bug, or might shed light on the use-after-free bug that we have been theorizing exists in the kernel. Or it could be much more mundane.

There is a link to one in the lkml post:
http://www.fi.muni.cz/~xslaby/sklad/panics/915_error_state

This *is* with the patch applied.

Comment 20 Jiri Slaby 2012-02-14 01:39:08 UTC

(In reply to comment #18)
> Created attachment 56981 [details] [review] [review]
> Different kind of finish_gpu patch
> 
> Can you also try this patch _instead_ of the finish_gpu one you're currently using? If this also ends up in a gpu hang, please attach the error_state.

This also causes bad tiles. I will report if this causes GPU hangs in few days (not so easy to reproduce).

Comment 21 Chris Wilson 2012-02-14 02:31:00 UTC

The GPU state looks internally consistent, I haven't spotted the error that is causing it to hang. Which is why I want more error states to see if the pattern is the same, or to see if the problem becomes more apparent.

Comment 22 Jiri Slaby 2012-02-15 15:35:53 UTC

Created attachment 57123 [details]
error state from today

Comment 23 Jiri Slaby 2012-02-15 15:40:05 UTC

(In reply to comment #21)
> The GPU state looks internally consistent, I haven't spotted the error that is causing it to hang. Which is why I want more error states to see if the pattern
> is the same, or to see if the problem becomes more apparent.

Ok, I attached one that happened few minutes ago. This is with patch from comment #18. Do you want more with the "drm/i915: Only clear the GPU domains upon a successful finish" or it doesn't matter which patch is applied?

And yes, misrendered tiles are definitely bound to these GPU hangs. I saw more and more such tiles on the maps so I tried the usual trigger of a GPU hang -- open iGoogle page in firefox. And voilà, it indeed hanged ;).

Comment 24 Jiri Slaby 2012-02-15 15:46:41 UTC

And if it is important dmesg said for the state in comment #22:
[drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
[drm:i915_wait_request] *ERROR* i915_wait_request returns -11 (awaiting 6822006 at 6821999, next 6822007)
[drm:i915_reset] *ERROR* Failed to reset chip.

Comment 25 Chris Wilson 2012-02-15 16:02:08 UTC

Here we go:

buffer: 0b000000     8192 0006 0000 00681876 P X dirty render uncached (fence: 8)

fence[8] = 0b000001
  valid, x-tiled, pitch: 512, start: 0x03000000, size: 104857

but used consistently within the batch as

0x0a200b80:      0x54300004: XY_COLOR_BLT (rgb enabled, alpha enabled, src tile 
0, dst tile 0)
0x0a200b84:      0x03f000c0:    format 8888, pitch 192, rop 0xf0, clipping disab
led,  
0x0a200b88:      0x00000000:    (0,0)
0x0a200b8c:      0x00250028:    (40,37)
0x0a200b90:      0x0b000000:    offset 0x0b000000
0x0a200b94:      0x00000000:    color

or

0x0a200a34:      0x7d8e0001: 3DSTATE_BUFFER_INFO
0x0a200a38:      0x03000040:    color, tiling = none, pitch=64
0x0a200a3c:      0x0b000000:    address

i.e.as an untiled temporary render target.

So it looks like this is entirely an ddx vs kernel confusion. The ddx believes that it has an untiled buffer, but the kernel is insistent that it never received the command to clear the tiling.

Comment 26 Jiri Slaby 2012-02-17 13:51:22 UTC

Created attachment 57231 [details]
error state from today

Today, another one. I suppose you don't need more of them (I switched back to the driver with the workaround)?

[drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
[drm:i915_wait_request] *ERROR* i915_wait_request returns -11 (awaiting 389582 at 389576, next 389583)
[drm:i915_reset] *ERROR* Failed to reset chip.

Comment 27 Chris Wilson 2012-02-19 04:11:41 UTC

Ok, that error state confirms the pattern. It is dieing on a BLT command that conflicts with the fence registers.

Comment 28 Daniel Vetter 2012-02-20 04:04:29 UTC

I've looked again at the example corruption image and some wrong fencing (it looks like broken stride with the corruption nicely aligned to X-tiled tiles) looks most plausible.

Comment 29 Jiri Slaby 2012-02-21 03:09:33 UTC

Today I hit the following warning in the kernel. Probably after a GPU hang:
WARN_ON(dev_priv->fence_regs[obj->fence_reg].pin_count);

Comment 30 Daniel Vetter 2012-02-21 04:19:16 UTC

Can you please attach the entire backtrace?
Am 21.02.2012 12:09 schrieb <bugzilla-daemon@freedesktop.org>:

> https://bugs.freedesktop.org/show_bug.cgi?id=43427
>
> --- Comment #29 from Jiri Slaby <jirislaby@gmail.com> 2012-02-21 03:09:33
> PST ---
> Today I hit the following warning in the kernel. Probably after a GPU hang:
> WARN_ON(dev_priv->fence_regs[obj->fence_reg].pin_count);
>
> --
> Configure bugmail: https://bugs.freedesktop.org/userprefs.cgi?tab=email
> ------- You are receiving this mail because: -------
> You are on the CC list for the bug.
> You are the assignee for the bug.
>

Comment 31 Jiri Slaby 2012-02-21 04:34:59 UTC

(In reply to comment #30)
> Can you please attach the entire backtrace?

Unfortunately no, because it was not logged :( (and I expected it to be). So I remembered only the line. And after reboot I looked into the code and pasted the WARN here.

Comment 32 Chris Wilson 2012-02-21 10:05:29 UTC

Created attachment 57409 [details] [review]
Maintain fenced gpu access until flushed

Hmm, once upon a time I thought this was a required bug fix. So probably it still is relevant.

Comment 33 Jiri Slaby 2012-02-22 00:28:48 UTC

(In reply to comment #32)
> Created attachment 57409 [details] [review] [review]
> Maintain fenced gpu access until flushed

I suppose I should apply that instead of the patch from comment #18 and in companion with "drm/i915: Only clear the GPU domains upon a successful finish", right?

Comment 34 Daniel Vetter 2012-02-22 00:32:51 UTC

> --- Comment #33 from Jiri Slaby <jirislaby@gmail.com> 2012-02-22 00:28:48 PST ---
> (In reply to comment #32)
>> Created attachment 57409 [details] [review] [review]
>> Maintain fenced gpu access until flushed
>
> I suppose I should apply that instead of the patch from comment #18 and in
> companion with "drm/i915: Only clear the GPU domains upon a successful finish",
> right?

I think both this patch alone and this patch + "drm/i915: Only clear
the GPU domains upon a successful finish" are interesting
combinations, so please try both of them.

Comment 35 Jiri Slaby 2012-02-27 04:24:55 UTC

Created attachment 57711 [details]
915_error_state with "Maintain fenced gpu access until flushed" only

(In reply to comment #34)
> I think both this patch alone

This means a death within hours. An error state attached. Dmesg:
[drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
[drm:i915_wait_request] *ERROR* i915_wait_request returns -11 (awaiting 286724 at 286718, next 286726)
[drm:i915_reset] *ERROR* Failed to reset chip.

> and this patch + "drm/i915: Only clear
> the GPU domains upon a successful finish" are interesting
> combinations

Now running a kernel with both of them, bad tiles in maps are still there. If this leads to a GPU hang, I will report l8r.

Comment 36 Jiri Slaby 2012-02-29 12:52:39 UTC

Created attachment 57827 [details]
915_error_state with both patches

(In reply to comment #35)
> Now running a kernel with both of them, bad tiles in maps are still there. If
> this leads to a GPU hang, I will report l8r.

Yes, with both patches applied, I still get GPU hangs like this:
[drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... GPU hung
[drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
[drm:i915_wait_request] *ERROR* i915_wait_request returns -11 (awaiting 2215296 at 2215286, next 2215297)
[drm:i915_reset] *ERROR* Failed to reset chip.

Comment 37 Jiri Slaby 2012-03-01 11:48:53 UTC

(In reply to comment #30)
> Can you please attach the entire backtrace?

Here you are:

WARNING: at drivers/gpu/drm/i915/i915_gem.c:2368 i915_gem_object_put_fence+0xbd/0xd0()
Hardware name: To Be Filled By O.E.M.
Modules linked in: pl2303 usbserial microcode
Pid: 4287, comm: Xorg Not tainted 3.3.0-rc5-next-20120227_64+ #1655
Call Trace:
 [<ffffffff81065b6a>] warn_slowpath_common+0x7a/0xb0
 [<ffffffff81065bb5>] warn_slowpath_null+0x15/0x20
 [<ffffffff8134c73d>] i915_gem_object_put_fence+0xbd/0xd0
 [<ffffffff8134dbef>] i915_gem_object_unbind+0x7f/0x1b0
 [<ffffffff8134dd3a>] i915_gem_free_object_tail+0x1a/0xd0
 [<ffffffff81350651>] i915_gem_free_object+0x51/0x60
 [<ffffffff813261d5>] drm_gem_object_free+0x25/0x40
 [<ffffffff81359e18>] intel_user_framebuffer_destroy+0x68/0x70
 [<ffffffff813343a3>] drm_fb_release+0x83/0xb0
 [<ffffffff81325e58>] drm_release+0x5d8/0x6d0
 [<ffffffff81121372>] fput+0xe2/0x250
 [<ffffffff8111dd21>] filp_close+0x61/0x90
 [<ffffffff81069270>] put_files_struct+0x80/0xe0
 [<ffffffff81069375>] exit_files+0x45/0x50
 [<ffffffff81069d53>] do_exit+0x683/0x900
 [<ffffffff8113c09f>] ? mntput+0x1f/0x30
 [<ffffffff81121439>] ? fput+0x1a9/0x250
 [<ffffffff8163bb14>] ? __schedule+0x294/0x670
 [<ffffffff8106a30f>] do_group_exit+0x3f/0xb0
 [<ffffffff8106a392>] sys_exit_group+0x12/0x20
 [<ffffffff8163d7a2>] system_call_fastpath+0x16/0x1b

Comment 38 Chris Wilson 2012-03-20 07:41:00 UTC

I think I may have stumbled upon something...

I've put some patches up at http://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=for-jiri of particular interest is http://cgit.freedesktop.org/~ickle/linux-2.6/commit/?h=for-jiri&id=79710e6ccabdac80c65cd13b944695ecc3e42a9d

The problem that I spotted is that a batch with an unfenced BLT command is not marked with fenced_gpu_access which means that we think we can modify the fence whilst that command is in flight. obj->fenced_gpu_access |= obj->pending_fenced_gpu_access I think was a partial solution to that problem without. So the key change in that patch is

@@ -494,12 +493,12 @@ pin_and_fence_object(struct drm_i915_gem_object *obj,
entry->flags |= __EXEC_OBJECT_HAS_FENCE;
i915_gem_object_pin_fence(obj);
} else {
- ret = i915_gem_object_put_fence(obj);
+ ret = i915_gem_object_put_fence(obj, ring);
if (ret)
goto err_unpin;
}
+ obj->pending_fenced_gpu_access = true;
}
- obj->pending_fenced_gpu_access = need_fence;
}

(with some supporting chunks required, the rest were trying to make pipelined fencing happy.)

Comment 39 Chris Wilson 2012-03-20 07:55:04 UTC

Created attachment 58757 [details] [review]
Mark untiled BLT commands as fenced

Comment 40 Jiri Slaby 2012-03-20 10:09:56 UTC

(In reply to comment #38)
> I think I may have stumbled upon something...

Bad tiles in maps are gone with any of:
- with your kernel and ddx 2.17
- with 3.3.0-rc7-next-20120319 and ddx 2.18

I believe the GPU hangs are connected to bad tiles in maps. So I would say it is fixed.

And I would say something in between 2.17 and 2.18 made the problem harder to reproduce. (Or fixed it differently.) Because the problem is gone with unpatched kernel, but with 2.18 used.

Comment 41 Chris Wilson 2012-03-20 10:21:39 UTC

Jiri, do you mind attaching the Xorg.log from 2.17.0 and 2.18.0? From the other bug, it seems SNA was switched on for 2.18.0 and I was to confirm that and that 2.17.0 is UXA. (From other reports, SNA is a lot more resilient to tiling corruption than UXA. The only significant difference there would be the buffer management resulting in different usage patterns, I guess.)

Comment 42 Jiri Slaby 2012-03-20 13:32:44 UTC

Created attachment 58782 [details]
xorg.log with 2.18 (UXA)

(In reply to comment #41)
> Jiri, do you mind attaching the Xorg.log from 2.17.0 and 2.18.0?

So this is 2.18 compiled from git (which doesn't crash -- bug 47597); without SNA support. With my -next kernel, I cannot reproduce here.

There is xorg log as attachment 24023 [details] already. Do you need a fresh one?

Comment 43 Chris Wilson 2012-03-20 14:30:28 UTC

No, just trying to identify the commit that likely changed the behaviour from 2.17 with UXA (crash/tiling corruption) to 2.18 with UXA (stable).

Comment 44 Chris Wilson 2012-03-21 11:20:49 UTC

*** Bug 47398 has been marked as a duplicate of this bug. ***

Comment 45 Jiri Slaby 2012-03-22 03:20:51 UTC

(In reply to comment #43)
> No, just trying to identify the commit that likely changed the behaviour from
> 2.17 with UXA (crash/tiling corruption) to 2.18 with UXA (stable).

Scratch that. Today I got the tiling problem with 2.18+UXA+unpatched_kernel.

Comment 46 Daniel Vetter 2012-03-31 08:22:42 UTC

Just to check: Has this tiling issue ever showed up with the "Mark untiled BLT commands as fenced" kernel patch?

Comment 47 Jiri Slaby 2012-03-31 08:23:54 UTC

(In reply to comment #46)
> Just to check: Has this tiling issue ever showed up with the "Mark untiled BLT
> commands as fenced" kernel patch?

No, I haven't seen it since then.

Comment 48 Chris Wilson 2012-04-03 14:39:50 UTC

commit 7dd4906586274f3945f2aeaaa5a33b451c3b4bba
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Mar 21 10:48:18 2012 +0000

    drm/i915: Mark untiled BLT commands as fenced on gen2/3
    
    The BLT commands on gen2/3 utilize the fence registers and so we cannot
    modify any fences for the object whilst those commands are in flight.
    Currently we marked tiled commands as occupying a fence, but forgot to
    restrict the untiled commands from preventing a fence being assigned
    before they were completed.
    
    One side-effect is that we ten have to double check that a fence was
    allocated for a fenced buffer during move-to-active.
    
    Reported-by: Jiri Slaby <jirislaby@gmail.com>
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=43427
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=47990
    Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
    Testcase: i-g-t/tests/gem_tiled_after_untiled_blt
    Tested-by: Daniel Vetter <daniel.vetter@ffwll.ch>
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: stable@kernel.org
    Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>

Comment 49 Jiri Slaby 2012-04-10 10:54:12 UTC

(In reply to comment #48)
> commit 7dd4906586274f3945f2aeaaa5a33b451c3b4bba
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Wed Mar 21 10:48:18 2012 +0000
> 
>     drm/i915: Mark untiled BLT commands as fenced on gen2/3

Bad news. This version of patch causes a regression during resume. It looks like the console is not switched back to X. I still see the kernel messages.

If I revert 7dd49065862 and apply
ttp://cgit.freedesktop.org/~ickle/linux-2.6/commit/?h=for-jiri&id=79710e6ccabdac80c65cd13b944695ecc3e42a9d

instead, it works.

Bisection log if you care. (Crap is that something around 3.4-rc1 does not boot here so this was not easy to find :P.)

git bisect start '--' 'drivers/gpu/drm/'
# good: [c16fa4f2ad19908a47c63d8fa436a1178438c7e7] Linux 3.3
git bisect good c16fa4f2ad19908a47c63d8fa436a1178438c7e7
# bad: [0034102808e0dbbf3a2394b82b1bb40b5778de9e] Linux 3.4-rc2
git bisect bad 0034102808e0dbbf3a2394b82b1bb40b5778de9e
# good: [c57ebf5ef3588d21031f12e39131d79071269845] drm/nv50/pm: wait for all fifo-connected engines to idle before reclocking
git bisect good c57ebf5ef3588d21031f12e39131d79071269845
# good: [43b3cd995f304c983393b7ed6563f09781bc41d0] drm/radeon/kms: add initial DCE6 display watermark support
git bisect good 43b3cd995f304c983393b7ed6563f09781bc41d0
# skip: [09fa30226130652af75152d9010c603c66d46f6e] Merge branch 'drm-radeon-sitn-support' of git://people.freedesktop.org/~airlied/linux
git bisect skip 09fa30226130652af75152d9010c603c66d46f6e
# good: [1b2681ba271c9f5bb66cb0d8ceeaa215fcd218d8] drm/radeon/kms: update duallink checks for DCE6
git bisect good 1b2681ba271c9f5bb66cb0d8ceeaa215fcd218d8
# skip: [59365671464539dc695bbf4d4bf37aabfd8604f2] drm/nouveau/i2c: fix thinko/regression on really old chipsets
git bisect skip 59365671464539dc695bbf4d4bf37aabfd8604f2
# good: [1c9c20f60230bd5a6195d41f9dd2dfa60874b1da] drm: remove the second argument of k[un]map_atomic()
git bisect good 1c9c20f60230bd5a6195d41f9dd2dfa60874b1da
# bad: [83b7f9ac9126f0532ca34c14e4f0582c565c6b0d] drm/i915: allow to select rc6 modes via kernel parameter
git bisect bad 83b7f9ac9126f0532ca34c14e4f0582c565c6b0d
# good: [1898f4426b3863216a9041389b34a3b995883027] Merge branch 'drm-nouveau-next' of git://git.freedesktop.org/git/nouveau/linux-2.6 into drm-next
git bisect good 1898f4426b3863216a9041389b34a3b995883027
# skip: [a1978f74da69565a2e472394c7dcb2cfb31b3e45] gma500: medfield: fix build without CONFIG_BACKLIGHT_CLASS_DEVICE
git bisect skip a1978f74da69565a2e472394c7dcb2cfb31b3e45
# good: [55a254ac63a3ac1867d1501030e7fba69c7d4aeb] drm/i915: properly restore the ppgtt page directory on resume
git bisect good 55a254ac63a3ac1867d1501030e7fba69c7d4aeb
# bad: [7dd4906586274f3945f2aeaaa5a33b451c3b4bba] drm/i915: Mark untiled BLT commands as fenced on gen2/3
git bisect bad 7dd4906586274f3945f2aeaaa5a33b451c3b4bba

Comment 50 Daniel Vetter 2012-04-10 11:04:14 UTC

To clarify: If you revert 7dd4906586274f3945f2aeaaa5a33b451c3b4bba on top of 3.4-rc2, the resume regression is gone (but the tiling corruption is still there), but if you use plain 3.4-rc2, resume is broken?

Comment 51 Jiri Slaby 2012-04-10 11:07:09 UTC

(In reply to comment #50)
> To clarify: If you revert 7dd4906586274f3945f2aeaaa5a33b451c3b4bba on top of
> 3.4-rc2, the resume regression is gone (but the tiling corruption is still
> there), but if you use plain 3.4-rc2, resume is broken?

I don't know. I use -next tree from today. So:
3.4.0-rc2-next-20120410 -- broken resume
3.4.0-rc2-next-20120410 minus 7dd4906 -- working resume
3.4.0-rc2-next-20120410 minus 7dd4906 plus patch from here -- working resume

I haven't investigated the tiling corruption in any of the cases above.

Comment 52 Daniel Vetter 2012-04-10 11:08:59 UTC

That's even strange, because -next shouldn't contain any drm/i915 patches yet ...

Can you try to reproduce this on plain 3.4-rc2? I'm digging for a baseline, -next is a way to volatile target ...

Comment 53 Jiri Slaby 2012-04-10 11:13:56 UTC

(In reply to comment #52)
> That's even strange, because -next shouldn't contain any drm/i915 patches yet
> ...

But it contains 3.4-rc2 and more :).

Comment 54 Jiri Slaby 2012-04-10 12:54:42 UTC

(In reply to comment #51)
> (In reply to comment #50)
> > To clarify: If you revert 7dd4906586274f3945f2aeaaa5a33b451c3b4bba on top of
> > 3.4-rc2, the resume regression is gone (but the tiling corruption is still
> > there), but if you use plain 3.4-rc2, resume is broken?
> 
> I don't know. I use -next tree from today. So:
> 3.4.0-rc2-next-20120410 -- broken resume

BTW from the bisection log, you can see I started with 3.4-rc2 which does not work.
# bad: [0034102808e0dbbf3a2394b82b1bb40b5778de9e] Linux 3.4-rc2

> 3.4.0-rc2-next-20120410 minus 7dd4906 -- working resume
> 3.4.0-rc2-next-20120410 minus 7dd4906 plus patch from here -- working resume

What is worse, I have just bisected that the patch from here causes a ton of spurious interrupts. See https://lkml.org/lkml/2012/3/27/79

Comment 55 Chris Wilson 2012-04-10 13:28:35 UTC

The patch doesn't cause the interrupts itself, they already exist in the command stream but are masked until we need to wait to avoid the GPU hangs/corruption.

Comment 56 Daniel Vetter 2012-04-10 13:45:54 UTC

(In reply to comment #54)
> BTW from the bisection log, you can see I started with 3.4-rc2 which does not
> work.
> # bad: [0034102808e0dbbf3a2394b82b1bb40b5778de9e] Linux 3.4-rc2

Sorry, I've missed that. So reverting 7dd4906586274f3945f2aeaaa5a33b451c3b4bba on top of 3.4-rc2 (with no other patches applied) does fix resume for you again?

Comment 57 Jiri Slaby 2012-04-11 01:51:36 UTC

(In reply to comment #56)
> (In reply to comment #54)
> > BTW from the bisection log, you can see I started with 3.4-rc2 which does not
> > work.
> > # bad: [0034102808e0dbbf3a2394b82b1bb40b5778de9e] Linux 3.4-rc2
> 
> Sorry, I've missed that. So reverting 7dd4906586274f3945f2aeaaa5a33b451c3b4bba
> on top of 3.4-rc2 (with no other patches applied) does fix resume for you
> again?

Yes, exactly. (Except bad tiling should appear if I did not use SNA, but UXA. It's very hard to reproduce with SNA. [I haven't tried UXA.])

Comment 58 Chris Wilson 2012-04-11 02:09:45 UTC

Shotgun cleanup: http://cgit.freedesktop.org/~ickle/linux-2.6/commit/?h=amalgam&id=f59160192f91f5719eae840816792e5372a81b61

Comment 59 Chris Wilson 2012-04-12 02:06:56 UTC

The shotgun was accurate. Kudos to Daniel for the clean fix though,

commit 15a13bbdffb0d6288a5dd04aee9736267da1335f
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Thu Apr 12 01:27:57 2012 +0200

    drm/i915: clear fencing tracking state when retiring requests
    
    This fixes a resume regression introduced in
    
    commit 7dd4906586274f3945f2aeaaa5a33b451c3b4bba
    Author: Chris Wilson <chris@chris-wilson.co.uk>
    Date:   Wed Mar 21 10:48:18 2012 +0000
    
        drm/i915: Mark untiled BLT commands as fenced on gen2/3
    
    which fixed fencing tracking for untiled blt commands.
    
    A side effect of that patch was that now also untiled objects have a
    non-zero obj->last_fenced_seqno to track when a fence can be set up
    after a pipelined tiling change. Unfortunately this was only cleared
    by the fence setup and teardown code, resulting in tons of untiled but
    inactive objects with non-zero last_fenced_seqno.
    
    Now after resume we completely reset the seqno tracking, both on the
    driver side (by setting dev_priv->next_seqno = 1) and on the hw side
    (by allocating a new hws page, which contains the seqnos). Hilarity
    and indefinite waits ensued from the stale seqnos in
    obj->last_fenced_seqno from before the suspend.
    
    The fix is to properly clear the fencing tracking state like we
    already do for the normal gpu rendering while moving objects off the
    active list.
    
    Reported-and-tested-by: "Rafael J. Wysocki" <rjw@sisk.pl>
    Cc: Jiri Slaby <jslaby@suse.cz>
    Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
    Signed-Off-by: Daniel Vetter <daniel.vetter@ffwll.ch>

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.