Bug 68718 - [snb] vsync hang
Summary: [snb] vsync hang
Status: NEEDINFO
Alias: None
Product: xorg
Classification: Unclassified
Component: Driver/intel (show other bugs)
Version: unspecified
Hardware: All Linux (All)
: medium normal
Assignee: Chris Wilson
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
: 74411 (view as bug list)
Depends on:
Blocks:
 
Reported: 2013-08-29 17:44 UTC by Ilia Mirkin
Modified: 2014-12-10 16:10 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
/sys/kernel/debug/dri/0/i915_error_state (2.00 MB, text/plain)
2013-08-29 17:44 UTC, Ilia Mirkin
no flags Details
/sys/class/drm/card0/error (2.02 MB, text/plain)
2014-02-09 06:42 UTC, Ilia Mirkin
no flags Details
rc6=0 /sys/class/drm/card0/error (2.01 MB, text/plain)
2014-02-09 06:58 UTC, Ilia Mirkin
no flags Details
/sys/class/drm/card0/error rc9=0 (301.32 KB, text/plain)
2014-02-09 12:54 UTC, Martin Jørgensen
no flags Details
Serialise DERRMR write (2.38 KB, patch)
2014-02-11 11:48 UTC, Chris Wilson
no flags Details | Splinter Review
chrome-hang-2014-02-11-racer0 (2.02 MB, text/plain)
2014-02-11 16:09 UTC, Ilia Mirkin
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Ilia Mirkin 2013-08-29 17:44:40 UTC
Created attachment 84873 [details]
/sys/kernel/debug/dri/0/i915_error_state

This started happening with semi-recent software (mesa 9.1, xf86-video-intel 2.20, kernel 3.9.7) when I added an HDMI screen to my setup, placed above LVDS1. (HDMI screen is 1920x1200, LVDS screen is 1600x900). Whenever I go into compose ("new" version) in gmail, the screen hangs (although I can still move the cursor), and I get

[  952.397851] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[  952.397861] [drm:kick_ring] *ERROR* Kicking stuck wait on render ring

in dmesg.

I upgraded to 3.10.7, mesa 9.2.0, xf86-video-intel 2.21.15, and this hasn't had any impact over the issue. Attached is an error state file I captured earlier on. Note that running chromium with LIBGL_ALWAYS_SOFTWARE=1 fixes the problem.

Hardware:
00:02.0 VGA compatible controller [0300]: Intel Corporation 2nd Generation Core Processor Family Integrated Graphics Controller [8086:0126] (rev 09)
Thinkpad T420s laptop, no optimus.

[Come to think of it, I do remember this happening before I added the HDMI screen, but much more rarely, and not at all reproducibly.]

This has happened both with chromium 28 and 29. Works fine in firefox.
Comment 1 Ilia Mirkin 2013-08-30 18:30:38 UTC
I haven't tested all the variations, but it seems like the second HDMI screen is probably a red herring. However depending on the window size, the problem occurs (and the HDMI screen is bigger). Less than 900px in height, the problem almost never happens, more than 1000px, the problem almost always happens, irrespective of the screen that it's on. [Of course this never came up before since the LVDS panel is only 1600x900.]
Comment 2 Shuhao 2013-10-19 23:52:20 UTC
Is this a duplicate of https://bugs.freedesktop.org/show_bug.cgi?id=70151
?
Comment 3 Kenneth Graunke 2014-02-06 07:56:44 UTC
This is an SNA batch, hanging on an MI_WAIT_FOR_EVENT after whacking some registers.  Reassigning to xf86-video-intel and Chris Wilson.
Comment 4 Ilia Mirkin 2014-02-06 07:58:33 UTC
Before the inevitable suggestion from Chris, I'll be sure to re-test this with the latest xf86-video-intel driver (as well as latest mesa-git for good measure) and grab a fresh error state if the issue recurs.
Comment 5 Chris Wilson 2014-02-06 08:06:00 UTC
Might as well wait for a fresh error state from current stuff before crying. Ultimately I think this might need i915.enable_rc6=0. :(
Comment 6 Chris Wilson 2014-02-08 18:30:57 UTC
*** Bug 74411 has been marked as a duplicate of this bug. ***
Comment 7 Chris Wilson 2014-02-08 19:01:00 UTC
Can either of you confirm that running with i915.enable_r6=0 (and cat /sys/kernel/debug/dri/0/i915_drpc_info to confirm that is disabled) prevents the hang?
Comment 8 Ilia Mirkin 2014-02-09 06:42:30 UTC
Created attachment 93691 [details]
/sys/class/drm/card0/error

Well, with the latest mesa (and 3.13-rc7, and chrome 32), I can't get gmail to hang anymore by going into the compose screen. But not to be defeated, I loaded up google maps, and as part of loading there were ~3 hangs until I managed to close the tab. (I didn't have to do anything fancy, just load it... this is the "maps preview" thing with google earth basically built into it.)

These were the relevant bits from the log:

[473527.324523] [drm:ring_stuck] *ERROR* Kicking stuck wait on render ring
[473527.324531] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[473527.324534] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[473527.324535] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[473527.324537] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[473527.324538] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[473534.318239] [drm:ring_stuck] *ERROR* Kicking stuck wait on render ring
[473541.315958] [drm:ring_stuck] *ERROR* Kicking stuck wait on render ring

Not sure this is the same issue as the one from before -- if it's not, happy to open another bug. I will go and check whether disabling rc6 helps now (although that would be a workaround, not a solution).
Comment 9 Ilia Mirkin 2014-02-09 06:58:38 UTC
Created attachment 93692 [details]
rc6=0 /sys/class/drm/card0/error

OK, disabling rc6, google maps still locks up. (i.e. https://www.google.com/maps/preview)

I double-checked that rc6 was disabled:

# cat /sys/kernel/debug/dri/0/i915_drpc_info
RC information accurate: yes
Video Turbo Mode: yes
HW control enabled: yes
SW control enabled: no
RC1e Enabled: no
RC6 Enabled: no
Deep RC6 Enabled: no
Deepest RC6 Enabled: no
Current RC state: on
Core Power Down: no
RC6 "Locked to RPn" residency since boot: 0
RC6 residency since boot: 5356677
RC6+ residency since boot: 0
RC6++ residency since boot: 1115939
RC6   voltage: 450mV
RC6+  voltage: 245mV
RC6++ voltage: 245mV

In case the question of versions comes up, I'm at mesa commit 356aff3. DDX is at 1cbc59a. Chromium 32.0.1700.77
Comment 10 Chris Wilson 2014-02-09 12:21:45 UTC
Cool, that's important as for Ivybridge it is documented that the GPU must be kept awake across the vsync, so it does not look like that is required here.

Which leaves the open question whether any Sandybridges sold did not have the vsync feature (it was only added very late in the development cycle), or if there is a w/a that I missed.
Comment 11 Martin Jørgensen 2014-02-09 12:54:05 UTC
happened again:
[drm:ring_stuck] *ERROR* Kicking stuck wait on render ring

cat /sys/kernel/debug/dri/0/i915_drpc_info :
RC information accurate: yes
Video Turbo Mode: yes
HW control enabled: yes
SW control enabled: no
RC1e Enabled: no
RC6 Enabled: no
Deep RC6 Enabled: no
Deepest RC6 Enabled: no
Current RC state: on
Core Power Down: no
RC6 "Locked to RPn" residency since boot: 0
RC6 residency since boot: 12259440
RC6+ residency since boot: 0
RC6++ residency since boot: 1072070
RC6   voltage: 450mV
RC6+  voltage: 245mV
RC6++ voltage: 245mV

attaching /sys/class/drm/card0/error
Comment 12 Martin Jørgensen 2014-02-09 12:54:40 UTC
Created attachment 93707 [details]
/sys/class/drm/card0/error rc9=0
Comment 13 Ilia Mirkin 2014-02-09 22:53:08 UTC
(In reply to comment #10)
> Cool, that's important as for Ivybridge it is documented that the GPU must
> be kept awake across the vsync, so it does not look like that is required
> here.
> 
> Which leaves the open question whether any Sandybridges sold did not have
> the vsync feature (it was only added very late in the development cycle), or
> if there is a w/a that I missed.

I'm not really sure what you meant by your comment... Do the new traces show the same vsync error, or are they due to some other thing that maps triggers? (In the latter case should I file a fresh bug?)

As for SNB revision, not sure if this is enough to identify which one I have, but here's some info:

00:02.0 VGA compatible controller [0300]: Intel Corporation 2nd Generation Core Processor Family Integrated Graphics Controller [8086:0126] (rev 09) (prog-if 00 [VGA controller])
        Subsystem: Lenovo Device [17aa:21d2]

vendor_id       : GenuineIntel
cpu family      : 6
model           : 42
model name      : Intel(R) Core(TM) i5-2540M CPU @ 2.60GHz
stepping        : 7
microcode       : 0x28
Comment 14 Daniel Vetter 2014-02-11 10:47:53 UTC
On Sun, Feb 9, 2014 at 11:53 PM,  <bugzilla-daemon@freedesktop.org> wrote:
>
> I'm not really sure what you meant by your comment... Do the new traces show
> the same vsync error, or are they due to some other thing that maps triggers?
> (In the latter case should I file a fresh bug?)

If it says "Kicking stuck wait on render ring" it's the same bug. But
otherwise there's a fair chance that gmaps can hang the render ring
due to some know  mesa/hw contexts/hiz issues (suspected cause at
least) on snb. But with latest mesa those hangs should be really rare.

If you hit them though please file a new bug report.
-Daniel
Comment 15 Ilia Mirkin 2014-02-11 10:51:28 UTC
(In reply to comment #14)
> On Sun, Feb 9, 2014 at 11:53 PM,  <bugzilla-daemon@freedesktop.org> wrote:
> >
> > I'm not really sure what you meant by your comment... Do the new traces show
> > the same vsync error, or are they due to some other thing that maps triggers?
> > (In the latter case should I file a fresh bug?)
> 
> If it says "Kicking stuck wait on render ring" it's the same bug. But

See comment #8 -- that's exactly what it said. I'm _pretty_ sure that's what it also said for when I had enable_rc6=0, I can double-check if it's not apparent from the error state I uploaded.
Comment 16 Chris Wilson 2014-02-11 11:48:25 UTC
Created attachment 93855 [details] [review]
Serialise DERRMR write

Sorry, I thought I was being clear when I acknowledged the hang still occurred with rc6, and so we need an explanation that doesn't involve rc6.

Attached is a patch to serialise the write to DERRMR prior to us loading the scanline window and waiting upon the result. We've tried this in the past but it didn't seem to make any improvement (though iirc that was in connection with rc6 hangs + vsync).
Comment 17 Ilia Mirkin 2014-02-11 15:18:29 UTC
(In reply to comment #16)
> Created attachment 93855 [details] [review] [review]
> Serialise DERRMR write
> 
> Sorry, I thought I was being clear when I acknowledged the hang still
> occurred with rc6, and so we need an explanation that doesn't involve rc6.
> 
> Attached is a patch to serialise the write to DERRMR prior to us loading the
> scanline window and waiting upon the result. We've tried this in the past
> but it didn't seem to make any improvement (though iirc that was in
> connection with rc6 hangs + vsync).

Much harder to reproduce with this patch. maps didn't seem to do it anymore. However I loaded up http://helloracer.com/racer-s/, and after a little while (30-60s), it hung:

[112245.319716] [drm:ring_stuck] *ERROR* Kicking stuck wait on render ring

Let me know if you'd like the error file. And of course after having closed the racer down, loading google maps also caused the hang. (Hm, now maps is freezing a _lot_ again...)
Comment 18 Chris Wilson 2014-02-11 16:07:11 UTC
(In reply to comment #17)
> Let me know if you'd like the error file. And of course after having closed
> the racer down, loading google maps also caused the hang. (Hm, now maps is
> freezing a _lot_ again...)

Please, that would be useful to sanity check the instructions written into the ring. And just maybe there will be a different explanation for this hang...
Comment 19 Ilia Mirkin 2014-02-11 16:09:15 UTC
Created attachment 93873 [details]
chrome-hang-2014-02-11-racer0

Hang that happened when on http://helloracer.com/racer-s/ (which is a pretty well-done demo btw)
Comment 20 Chris Wilson 2014-02-11 16:20:41 UTC
Guessing now, but maybe try this delta patch on top:

diff --git a/src/sna/sna_display.c b/src/sna/sna_display.c
index 99afbd1..2620669 100644
--- a/src/sna/sna_display.c
+++ b/src/sna/sna_display.c
@@ -3984,7 +3984,7 @@ static bool sna_emit_wait_for_scanline_gen6(struct sna *sna,
        b = kgem_get_batch(&sna->kgem);
 
        /* Both the LRI and WAIT_FOR_EVENT must be in the same cacheline */
-       if (((sna->kgem.nbatch + 6) >> 4) != (sna->kgem.nbatch + 9) >> 4) {
+       if (((sna->kgem.nbatch + 6) >> 3) != (sna->kgem.nbatch + 10) >> 3) {
                int dw = sna->kgem.nbatch + 6;
                dw = ALIGN(dw, 16) - dw;
                while (dw--)


I think the cacheline it mentions is 64bytes, but we may as check against 32bytes just in case.
Comment 21 Ilia Mirkin 2014-02-11 16:30:35 UTC
(In reply to comment #20)
> Guessing now, but maybe try this delta patch on top:
> 
> diff --git a/src/sna/sna_display.c b/src/sna/sna_display.c
> index 99afbd1..2620669 100644
> --- a/src/sna/sna_display.c
> +++ b/src/sna/sna_display.c
> @@ -3984,7 +3984,7 @@ static bool sna_emit_wait_for_scanline_gen6(struct sna
> *sna,
>         b = kgem_get_batch(&sna->kgem);
>  
>         /* Both the LRI and WAIT_FOR_EVENT must be in the same cacheline */
> -       if (((sna->kgem.nbatch + 6) >> 4) != (sna->kgem.nbatch + 9) >> 4) {
> +       if (((sna->kgem.nbatch + 6) >> 3) != (sna->kgem.nbatch + 10) >> 3) {

Is the +9 -> +10 on purpose? That doesn't seem cache-line-equality-related...

>                 int dw = sna->kgem.nbatch + 6;
>                 dw = ALIGN(dw, 16) - dw;
>                 while (dw--)
> 
> 
> I think the cacheline it mentions is 64bytes, but we may as check against
> 32bytes just in case.
Comment 22 Chris Wilson 2014-02-11 16:33:58 UTC
Two random changes boiled into one. I was also thinking that maybe I needed to make sure the entire two commands fitted into the cacheline.
Comment 23 Ilia Mirkin 2014-02-11 16:54:59 UTC
(In reply to comment #20)
> Guessing now, but maybe try this delta patch on top:
> 
> diff --git a/src/sna/sna_display.c b/src/sna/sna_display.c
> index 99afbd1..2620669 100644
> --- a/src/sna/sna_display.c
> +++ b/src/sna/sna_display.c
> @@ -3984,7 +3984,7 @@ static bool sna_emit_wait_for_scanline_gen6(struct sna
> *sna,
>         b = kgem_get_batch(&sna->kgem);
>  
>         /* Both the LRI and WAIT_FOR_EVENT must be in the same cacheline */
> -       if (((sna->kgem.nbatch + 6) >> 4) != (sna->kgem.nbatch + 9) >> 4) {
> +       if (((sna->kgem.nbatch + 6) >> 3) != (sna->kgem.nbatch + 10) >> 3) {

I also tried leaving this at >> 4 (but with the +10)

>                 int dw = sna->kgem.nbatch + 6;
>                 dw = ALIGN(dw, 16) - dw;

And I tried making this ALIGN(..., 8) with the >> 3 shift

>                 while (dw--)
> 
> 
> I think the cacheline it mentions is 64bytes, but we may as check against
> 32bytes just in case.

In all the cases, I was able to reproduce

[112535.225840] [drm:ring_stuck] *ERROR* Kicking stuck wait on render ring

Every so often I also got

[117822.524434] [drm] no progress on render ring
[117822.524482] [drm:i915_set_reset_status] *ERROR* render ring hung flushing bo (0x60e000 ctx 0) at 0x1de05ce4

But my understanding is that this is unrelated to the issue at hand.
Comment 24 Ilia Mirkin 2014-02-11 19:03:34 UTC
I read a bunch of documentation, I'm sure you already know all this, but here is what I found:

About the scanline thing (DE_LOAD_SL, aka 0x4f100):

"""
Notes for command streamer programming to use this display load scan lines register: 

Either MMIO or a MI_LOAD_REGISTER_IMM command can be used to unmask the scan line render response 0x44050. That can be done any time before programming this register. 

In order to use MI_WAIT_FOR_EVENT on scan line window, DE_LOAD_SL must be programmed using MI_LOAD_REGISTER_IMM immediately prior to the MI_WAIT_FOR_EVENT on scan line window, both commands must be in the same cacheline, both commands must be executed using the same tail or batch update, and if sync flush is enabled, MI_SUSPEND_FLUSH must be used to suspend flushes prior to the commands.
"""

Also this in MI_WAIT_FOR_EVENT:

"""
Software must disable MI_WAIT_FOR_EVENT RC6 entry via RC_PSMI_CTRL if MI_WAIT_FOR_EVENT is parsed in a batch buffer with the following attributes set: 

* batch buffer in PPGTT space (labeled “non-secure” in command)
* CB^2 batch buffer 

MI_NOOP setting NOP register (or any other benign command) must be set after MI_WAIT_FOR_EVENT under the following conditions 

* Back-to-back MI_WAIT_FOR_EVENT commands
* MI_WAIT_FOR_EVENT is the last command before head = tail
"""

Lastly, it looks like there's a PIPEA_SLC (0x70004) which I guess needs to be programmed on [DevSNB:D2], whatever that is (I'm hoping "pre-release hw").

No idea if these apply, and you were probably already aware of these things, but thought I'd mention anyways. (Is MI_SUSPEND_FLUSH used? Can the commands get broken up into different batch buffers behind the scenes? What's CB^2?)

Lastly, is the masking logic for DERRMR correct? It uses ~event... I couldn't find the spec for what the various bits actually mean, I guess they're meant to line up with the MI_WAIT_FOR_EVENT bits? Would be good to double-check... (Also, why do you need to store the DERRMR value somewhere? Seems odd to me, but what do I know.)
Comment 25 Chris Wilson 2014-02-11 20:29:22 UTC
(In reply to comment #24)
> Lastly, it looks like there's a PIPEA_SLC (0x70004) which I guess needs to
> be programmed on [DevSNB:D2], whatever that is (I'm hoping "pre-release hw").

Right, (SLC = scanline counter) which is programmed by setting DE_LOAD_SL. The tricky part is that DE_LOAD_SL only exists for later chips - some early production chips do not have the register and so have no way to program PIPE*_SLC.

In fact, can you do intel_reg_read 0x4f100? Or intel_reg_write 0x4f100 0xdeadbeef; intel_reg_read 0x4f100

> No idea if these apply, and you were probably already aware of these things,
> but thought I'd mention anyways. (Is MI_SUSPEND_FLUSH used? Can the commands
> get broken up into different batch buffers behind the scenes? What's CB^2?)

No, those restrictions do not apply as we neither use suspend flush, non-secure batch buffer for LRI, ppgtt batches or chained batchbuffers (CB^2).

> Lastly, is the masking logic for DERRMR correct? It uses ~event... I
> couldn't find the spec for what the various bits actually mean, I guess
> they're meant to line up with the MI_WAIT_FOR_EVENT bits?

It is just a happy coincidence on SNB that the bits do line up between DERRMR and the WAIT_FOR_EVENT. Yes, I have checked those many times.

> Would be good to
> double-check... (Also, why do you need to store the DERRMR value somewhere?
> Seems odd to me, but what do I know.)

Because the hardware is broken, and this is one of the w/a. I don't think it is required because the wait will ensure that the flush will happen before we write to the same register again, hence why I left it out of the original code. But it was something easy enough to test. Again.
Comment 26 Ilia Mirkin 2014-02-11 20:45:27 UTC
(In reply to comment #25)
> (In reply to comment #24)
> > Lastly, it looks like there's a PIPEA_SLC (0x70004) which I guess needs to
> > be programmed on [DevSNB:D2], whatever that is (I'm hoping "pre-release hw").
> 
> Right, (SLC = scanline counter) which is programmed by setting DE_LOAD_SL.
> The tricky part is that DE_LOAD_SL only exists for later chips - some early
> production chips do not have the register and so have no way to program
> PIPE*_SLC.

My read was that they did -- you had to write the bottom bits of PIPE*_SLC. And on the chips with DE_LOAD_SL, those bottom bits flip to reserved. I'm not too familiar with the Intel PRM notation though, so I could easily have misunderstood.

> 
> In fact, can you do intel_reg_read 0x4f100? Or intel_reg_write 0x4f100
> 0xdeadbeef; intel_reg_read 0x4f100

intel-gpu-tools # ./tools/intel_reg_write 0x4f100 0xdeadbeef
Value before: 0x1000100
Value after: 0xDEADBEEF
intel-gpu-tools # ./tools/intel_reg_read 0x4f100
0x4F100 : 0xDEADBEEF
(wait a little while)
intel-gpu-tools # ./tools/intel_reg_read 0x4f100
0x4F100 : 0x1000100

> > Lastly, is the masking logic for DERRMR correct? It uses ~event... I
> > couldn't find the spec for what the various bits actually mean, I guess
> > they're meant to line up with the MI_WAIT_FOR_EVENT bits?
> 
> It is just a happy coincidence on SNB that the bits do line up between
> DERRMR and the WAIT_FOR_EVENT. Yes, I have checked those many times.

I'm sure you're right, but could you point out where those are specified? They weren't next to DERRMR. BTW, may be worth pointing out that right now I have 2 screens connected, which I guess means both pipes get used?

> 
> > Would be good to
> > double-check... (Also, why do you need to store the DERRMR value somewhere?
> > Seems odd to me, but what do I know.)
> 
> Because the hardware is broken, and this is one of the w/a. I don't think it
> is required because the wait will ensure that the flush will happen before
> we write to the same register again, hence why I left it out of the original
> code. But it was something easy enough to test. Again.

Good reason for me to be confused :)

Another oddity:

	/* Always program one less than the desired value */
	if (--y1 < 0)
		y1 = crtc->bounds.y2;

...

	if (y2 == y1)
		return false;

Should that perhaps be if (y2 < y1)?

And one last thought: Should the DE_LOAD_SL stuff still be done if full_height == true? Then we're just waiting for the vsync...
Comment 27 Chris Wilson 2014-02-11 20:57:05 UTC
(In reply to comment #26)
> (In reply to comment #25)
> > (In reply to comment #24)
> > > Lastly, it looks like there's a PIPEA_SLC (0x70004) which I guess needs to
> > > be programmed on [DevSNB:D2], whatever that is (I'm hoping "pre-release hw").
> > 
> > Right, (SLC = scanline counter) which is programmed by setting DE_LOAD_SL.
> > The tricky part is that DE_LOAD_SL only exists for later chips - some early
> > production chips do not have the register and so have no way to program
> > PIPE*_SLC.
> 
> My read was that they did -- you had to write the bottom bits of PIPE*_SLC.
> And on the chips with DE_LOAD_SL, those bottom bits flip to reserved. I'm
> not too familiar with the Intel PRM notation though, so I could easily have
> misunderstood.

But you can't actually program an inclusive window into SLC -- how the memories are coming back to me.

> > In fact, can you do intel_reg_read 0x4f100? Or intel_reg_write 0x4f100
> > 0xdeadbeef; intel_reg_read 0x4f100
> 
> intel-gpu-tools # ./tools/intel_reg_write 0x4f100 0xdeadbeef
> Value before: 0x1000100
> Value after: 0xDEADBEEF
> intel-gpu-tools # ./tools/intel_reg_read 0x4f100
> 0x4F100 : 0xDEADBEEF
> (wait a little while)
> intel-gpu-tools # ./tools/intel_reg_read 0x4f100
> 0x4F100 : 0x1000100

Bleh, so either it's a working chip or this method of detection for nonworking chips is inadequate. :(
 
> > > Lastly, is the masking logic for DERRMR correct? It uses ~event... I
> > > couldn't find the spec for what the various bits actually mean, I guess
> > > they're meant to line up with the MI_WAIT_FOR_EVENT bits?
> > 
> > It is just a happy coincidence on SNB that the bits do line up between
> > DERRMR and the WAIT_FOR_EVENT. Yes, I have checked those many times.
> 
> I'm sure you're right, but could you point out where those are specified?
> They weren't next to DERRMR. BTW, may be worth pointing out that right now I
> have 2 screens connected, which I guess means both pipes get used?

Scan through for a block titled
"Display Engine Render Response Message Bit Definition"


> > 
> > > Would be good to
> > > double-check... (Also, why do you need to store the DERRMR value somewhere?
> > > Seems odd to me, but what do I know.)
> > 
> > Because the hardware is broken, and this is one of the w/a. I don't think it
> > is required because the wait will ensure that the flush will happen before
> > we write to the same register again, hence why I left it out of the original
> > code. But it was something easy enough to test. Again.
> 
> Good reason for me to be confused :)
> 
> Another oddity:
> 
> 	/* Always program one less than the desired value */
> 	if (--y1 < 0)
> 		y1 = crtc->bounds.y2;
> 
> ...
> 
> 	if (y2 == y1)
> 		return false;
> 
> Should that perhaps be if (y2 < y1)?

It's actually legal for y2 to be less than y1 - e.g. any window that starts on the top line.
 
> And one last thought: Should the DE_LOAD_SL stuff still be done if
> full_height == true? Then we're just waiting for the vsync...

You could skip the LRI in that case - it is just simpler to always emit it.
Comment 28 Ilia Mirkin 2014-02-11 21:49:07 UTC
(In reply to comment #27)
> (In reply to comment #26)
> > Another oddity:
> > 
> > 	/* Always program one less than the desired value */
> > 	if (--y1 < 0)
> > 		y1 = crtc->bounds.y2;
> > 
> > ...
> > 
> > 	if (y2 == y1)
> > 		return false;
> > 
> > Should that perhaps be if (y2 < y1)?
> 
> It's actually legal for y2 to be less than y1 - e.g. any window that starts
> on the top line.

Oh, because it does it in a circular fashion, so 1 less than 0 is back to the bottom?

One last question... what makes you think it's this code that's causing issues? i.e. what do I look at in the hang to work that out (so that I can go and try 100 diff things and make sure that this is the place where things are hanging).

Also -- I don't know what kind of resources you have available, but I don't exactly have a "rare" hardware setup. Lenovo T420s laptop, internal LVDS panel, and either a VGA or HDMI-connected 1920x1200 screen depending on where I am (and I'm pretty sure the second screen doesn't actually play a part here). 

Load up chrome and go to google maps (I think the reason I was having trouble getting it to hang earlier was that they had for some reason temporarily turned off Earth mode... now that it's back, it hangs on load again) or that webgl demo.

Is the suggestion that this is hard to reproduce and I'm in a (relatively) unique configuration/setup/etc? Or are lots of people seeing this, which makes it no easier to figure out what's going on?
Comment 29 Chris Wilson 2014-02-11 22:20:31 UTC
(In reply to comment #28)
> (In reply to comment #27)
> Oh, because it does it in a circular fashion, so 1 less than 0 is back to
> the bottom?

Right.

> One last question... what makes you think it's this code that's causing
> issues? i.e. what do I look at in the hang to work that out (so that I can
> go and try 100 diff things and make sure that this is the place where things
> are hanging).

In the GPU dump, ACTHD points to the address that the command streamer is at - usually the dword after what is being executed. IPEHR holds the command currently being executed. They both indicate that is the wait-for-event command that is waiting indefinitely, along with the ring-wait flag inside the RING_CTL register. As this is the only piece of code that tries to program vsync waits, we can be reasonably sure that it is the culprit. Then it is a guessing game as to what hardware state is incorrect, or what programming went wrong. To aide those guesses, we dump a number of other, hopefully relevant, registers.


> Also -- I don't know what kind of resources you have available, but I don't
> exactly have a "rare" hardware setup. Lenovo T420s laptop, internal LVDS
> panel, and either a VGA or HDMI-connected 1920x1200 screen depending on
> where I am (and I'm pretty sure the second screen doesn't actually play a
> part here). 
> 
> Load up chrome and go to google maps (I think the reason I was having
> trouble getting it to hang earlier was that they had for some reason
> temporarily turned off Earth mode... now that it's back, it hangs on load
> again) or that webgl demo.
> 
> Is the suggestion that this is hard to reproduce and I'm in a (relatively)
> unique configuration/setup/etc? Or are lots of people seeing this, which
> makes it no easier to figure out what's going on?

The number of people who use vsync waits is in the minority as anybody with a compositor will hit a different path. Even then of those, I have seen more reports of complete system lockups that seem to be rc6 in conjunction with vsync than reports of this GPU hang. So it may be just that the hang is rare, or that it requires a particular display/window configuration, or that it requires particular hardware. (As an anecdote, I have been using vsync on SNB with a single eDP panel for a couple of years without a single hang - since adding vsync support for Xv. And one of the tests for the ddx now include testing vsync on each pipe. Which is more likely to mean that display configuration plays a role, or just that the sample size is too small.)
Comment 30 Ilia Mirkin 2014-02-12 17:38:14 UTC
BTW, one thing that just occurred to me, at least perhaps why chrome (+ i965_dri) triggers this while other things don't -- I believe it uses GLX_OML_sync_control, which potentially can also trigger waits. I have yet to track down where/how that's implemented, but thought I'd mention it in case it rings a bell.
Comment 31 Ilia Mirkin 2014-02-24 05:22:42 UTC
Is there any suggested workaround? You mentioned something about compositing -- would running xcompmgr "fix" it? So far, chrome + llvmpipe = rock solid, chrome + i965 = hangs. Chrome most likely does different things when it's using i965 (like that OML_sync_control thing).

Is it worth trying to not do the scanline waits but instead wait for the whole frame, or is that a big power saving?
Comment 32 Chris Wilson 2014-02-24 08:19:00 UTC
Two workarounds:

1. Disable vsync entirely (such as using llvmpipe or xcompmgr).
Option "VSync" "false"

2. Replace the vsync waits with pageflips (such as using unity).
Option "TearFree" "true"
Comment 33 Chris Wilson 2014-12-10 11:57:12 UTC
Spotted a different hw w/a mentioned in bspec:

commit d247cb7d0cdb73736f31612157e47f166af68ba0
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Dec 8 10:07:25 2014 +0000

    sna/gen6: Poke PSMI control around WAIT_FOR_EVENT to prevent idling
    
    The bspec recommends preventing the hardware from going to sleep around
    a WAIT_FOR_EVENT, and tells us to use disable sleep bit in PSMI control
    to accomplish this.
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=62373
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>


Hopefully that does the trick.
Comment 34 Martin Jørgensen 2014-12-10 16:10:53 UTC
I'll test latest version (2.99.916-172-g04a09d3) on current Debian Jessie on HSW hardware.

I've tried set the --enable-tear-free=true argument, but according to my Xorg.0.log it's disabled:

[    15.610] (==) intel(0): TearFree disabled

Is it forced off or just disabled by default on HSW?


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct.