Bug 28788 - [i945 page flipping] GPU hang on 2.6.34-45 32-bit PAE kernel with GL compositor
Summary: [i945 page flipping] GPU hang on 2.6.34-45 32-bit PAE kernel with GL compositor
Status: RESOLVED FIXED
Alias: None
Product: xorg
Classification: Unclassified
Component: Driver/intel (show other bugs)
Version: unspecified
Hardware: x86 (IA32) Linux (All)
: medium normal
Assignee: Jesse Barnes
QA Contact: Xorg Project Team
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-06-28 02:34 UTC by Simon Farnsworth
Modified: 2010-07-08 10:08 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
dmesg from the failed unit. (35.34 KB, text/plain)
2010-06-28 02:35 UTC, Simon Farnsworth
no flags Details
Xorg.0.log from the failed machine (28.95 KB, text/plain)
2010-06-28 02:35 UTC, Simon Farnsworth
no flags Details
intel_error_dump output (100.23 KB, application/x-gzip)
2010-06-28 03:10 UTC, Simon Farnsworth
no flags Details
intel_error_dump output (100.63 KB, application/x-gzip)
2010-06-28 06:09 UTC, Simon Farnsworth
no flags Details
Xorg.0.log from the failed machine (28.95 KB, text/plain)
2010-06-28 06:10 UTC, Simon Farnsworth
no flags Details
Gzipped output from intel_gpu_dump (80.09 KB, application/x-gzip)
2010-06-29 01:45 UTC, Simon Farnsworth
no flags Details
New error state without the patches Chris pointed at (102.70 KB, application/x-gzip)
2010-06-29 10:51 UTC, Simon Farnsworth
no flags Details

Description Simon Farnsworth 2010-06-28 02:34:31 UTC
I'm trying to get page flipping working on all my Intel hardware, and I'm hitting a 100% reproducible hang on my 945GME (Intel D945GSEJT motherboard, Atom N270 CPU), using a home-grown OpenGL compositor.

I'm using:
 * Fedora kernel 2.6.34-45.fc14.i686.PAE (i686 architecture)
 * xf86-video-intel as of git 28c0ca676c47e7e38fabdd9ef24a70bd26701f33
 * xserver as of git 3b3c77b87070ddcdbb2acb114a81628485e7a129
 * mesa as of git 7a9246c5d72290ed8455a426801b85b54374e102
 * libdrm as of git 726210f87d558d558022f35bc8c839e798a19f0c

The rest of the system is stock Fedora 13. It's not affected by whether I use VGA or DVI - either way, I see a corrupt display (white has gone to black, two frames rendered on top of each other), and the logs tell me that the GPU has hung. I'm going to attach output of intel_error_dump, dmesg and the Xorg log - if there's missing information, let me know.

I'm quite happy to try patches - I can rebuild the Fedora kernel with patches as needed, and can obviously try other boot options if they'll get you a more informative trace.
Comment 1 Simon Farnsworth 2010-06-28 02:35:05 UTC
Created attachment 36565 [details]
dmesg from the failed unit.
Comment 2 Simon Farnsworth 2010-06-28 02:35:34 UTC
Created attachment 36566 [details]
Xorg.0.log from the failed machine
Comment 3 Chris Wilson 2010-06-28 03:04:34 UTC
You need to check /sys/kernel/debug/dri/0/i915_error_state as well. My guess is that you've hit one of i945 page-flipping bugs still lurking in 2.6.34.

Try:

https://bugs.freedesktop.org/attachment.cgi?id=36463
https://bugs.freedesktop.org/attachment.cgi?id=36464
Comment 4 Simon Farnsworth 2010-06-28 03:10:42 UTC
Created attachment 36567 [details]
intel_error_dump output

Looks like if you attach a large attachment as part of the original submission, Bugzilla loses it silently. Reattaching gzip'd version of intel_error_dump output.
Comment 5 Simon Farnsworth 2010-06-28 03:12:01 UTC
And I missed Chris's comments in-flight - I'll try both those patches together and report back.
Comment 6 Chris Wilson 2010-06-28 05:20:38 UTC
The batch buffer dump doesn't correspond to page-flip waits. The only striking error in the dump (consisting of just two ops...) is the DRAWING_RECT off-by-one. So I would update mesa first.
Comment 7 Simon Farnsworth 2010-06-28 06:09:10 UTC
Created attachment 36575 [details]
intel_error_dump output

I've added Chris's recommended kernel patches to the kernel, and updated Mesa to ce7a70b8b48a4dded9b1e29590b5101dacd56e0b. I'm still seeing a GPU hang in dmesg - attaching intel_error_dump output again.
Comment 8 Simon Farnsworth 2010-06-28 06:10:05 UTC
Created attachment 36576 [details]
Xorg.0.log from the failed machine

And new xserver log from the same failure.
Comment 9 Chris Wilson 2010-06-28 07:50:43 UTC
Right, that is just a single copy from 1200x1920 buffer to a 1920x1200. No obvious reason for failure, and it waiting for the GPU to finish executing those 2 triangles. This is an instance where it would be useful to check the vertex data...
Comment 10 Simon Farnsworth 2010-06-28 08:13:49 UTC
Not sure what you mean by "check the vertex data" - is there something I can do to a hung process to dig it out (I've got debug symbols, and know how to drive GDB if it's something I can dig out of Mesa's datastructures)?

The compositor is aiming to rotate the screen by 90° during the compositing process - the background image drawing part is what's hanging, and that appears to be slightly buggy (in that it's scaling the background image rather than rotating it).

Roughly outlined, the GL code that's hanging does:

/* during initialisation */
if( XGetWindowProperty( display, RootWindow( display, screen ),
                        XInternAtom( display, "_XROOTPMAP_ID", False ),
                        0, 4, False, AnyPropertyType,
                        &actual_type, &actual_format,
                        &nitems, &bytes_after, &prop) == Success &&
    actual_type == XInternAtom( m_display, "PIXMAP", False ) &&
    actual_format == 32 &&
    nitems == 1 )
{
    memcpy( &background, prop, 4 );
}
XFree( prop );

XImage *background_image = NULL;
if( background != None )
{
     background_image = XGetImage( display,
                                   background,
                                   0, 0,
                                   1200, 1920,
                                   AllPlanes,
                                   ZPixmap );
}

if( background == None || background_image == NULL )
{
    render_background = false;
    return;
}
glBindTexture( GL_TEXTURE_2D, texture );
glTexImage2D( GL_TEXTURE_2D, 0, GL_RGB,
              1200, 1920, 0,
              GL_BGRA,
              GL_UNSIGNED_BYTE, background_image_data);
glTexParameteri( GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_NEAREST );
glTexParameteri( GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST );

const GLfloat vertexes[] = { 0.0f, 1.0f,
                             0.0f, 0.0f,
                             1.0f, 1.0f,
                             1.0f, 0.0f };
const GLfloat texcoords[] = { 0.0f, 1.0f,
                              0.0f, 0.0f,
                              1.0f, 1.0f,
                              1.0f, 0.0f };

glGenBuffersARB( 2, buffers );

glBindBufferARB( GL_ARRAY_BUFFER, buffers[0] );
glBufferDataARB( GL_ARRAY_BUFFER, sizeof( GLfloat ) * 8, vertexes, GL_STATIC_DRAW );

glBindBufferARB( GL_ARRAY_BUFFER, buffers[1] );
glBufferDataARB( GL_ARRAY_BUFFER, sizeof( GLfloat ) * 8, texcoords, GL_STATIC_DRAW );

/* at time of hang */
glColor4f( 1.0, 1.0, 1.0, 1.0 );

glBindTexture( GL_TEXTURE_2D, texture );
glTexEnvi( GL_TEXTURE_ENV, GL_TEXTURE_ENV_MODE, GL_REPLACE );
glBindBufferARB( GL_ARRAY_BUFFER, buffers[0] );
glVertexPointer( 2, GL_FLOAT, sizeof(GLfloat) * 2, 0 );
glBindBufferARB( GL_ARRAY_BUFFER, buffers[1] );
glTexCoordPointer( 2, GL_FLOAT, sizeof(GLfloat) * 2, 0 );
glDrawArrays( GL_TRIANGLE_STRIP, 0, 4 );

My bug is that it's not rotating the vertexes array to handle the rotated screen - but I don't think this should cause a GPU hang.
Comment 11 Simon Farnsworth 2010-06-28 10:13:07 UTC
I just power failed the unit under test (on a hunch), and I'm getting a different failure state. Instead of a nice, clean GPU hang, I'm seeing rendering stalled waiting for a reply to DRI2GetBuffersWithFormat - and my frame hasn't completed rendering on screen.

This happens on the first frame I try to render; I'm not sure what state I can dump that will help with debugging.

If I restart X11, I get the same hang as documented already.

I've also not been able to work out which magic INTEL_DEBUG option would cause vertex data to get dumped - it appears that i945 uses the generic Mesa software TnL pipeline, and doesn't provide an option to dump the final transformed vertex data.
Comment 12 Simon Farnsworth 2010-06-29 01:44:02 UTC
intel_gpu_dump tells me that ACTHD is stuck at 0x398, which is an MI_NOOP in the ringbuffer. cat /proc/interrupts shows that I'm still getting interrupts, as does /sys/kernel/debug/dri/0/i915_gem_interrupt. Oddly, neither current sequence nor IRQ sequence in i915_gem_interrupt have changed overnight:

Interrupt enable:    00028c53
Interrupt identity:  00000000
Interrupt mask:      fffd73ae
Pipe A stat:         00020200
Pipe B stat:         00000000
Interrupts received: 3352510
Current sequence:    26
Waiter sequence:     0
IRQ sequence:        0

I'm therefore not sure whether the hang is purely CPU-side, or a mix of CPU-side and GPU-side; my understanding is that every so often, the GPU is supposed to execute an MI_STORE_DATA_INDEX that updates the CPU's idea of where the GPU is in the command stream, then an MI_USER_INTERRUPT to get the CPU to check.

The intel_gpu_dump header looks sane:
ACTHD: 0x00000398
EIR: 0x00000000
EMR: 0xffffffed
ESR: 0x00000000
PGTBL_ER: 0x00000000
IPEHR: 0x01000000
IPEIR: 0x00000000
INSTDONE: 0x7fffffc0
Ringbuffer: Reminder: head pointer is GPU read, tail pointer is CPU write
ringbuffer at 0x00000000:

The ringbuffer has a fairly regular pattern at the moment, in the bit preceding the HEAD pointer:

0x00000000:      0x02000000: MI_FLUSH
0x00000004:      0x00000000: MI_NOOP
0x00000008:      0x18800080: MI_BATCH_BUFFER_START
0x0000000c:      0x010ac001:    dword 1
0x00000010:      0x02000004: MI_FLUSH
0x00000014:      0x00000000: MI_NOOP
0x00000018:      0x10800001: MI_STORE_DATA_INDEX
0x0000001c:      0x00000080:    dword 1
0x00000020:      0x00000001:    dword 2
0x00000024:      0x01000000: MI_USER_INTERRUPT
0x00000028:      0x02000004: MI_FLUSH
0x0000002c:      0x00000000: MI_NOOP
0x00000030:      0x18800080: MI_BATCH_BUFFER_START
0x00000034:      0x010b2001:    dword 1
0x00000038:      0x02000004: MI_FLUSH
0x0000003c:      0x00000000: MI_NOOP
0x00000040:      0x10800001: MI_STORE_DATA_INDEX
0x00000044:      0x00000080:    dword 1
0x00000048:      0x00000002:    dword 2
0x0000004c:      0x01000000: MI_USER_INTERRUPT

repeating with different batch buffers until:

0x00000340:      0x02000000: MI_FLUSH
0x00000344:      0x00000000: MI_NOOP
0x00000348:      0x10800001: MI_STORE_DATA_INDEX
0x0000034c:      0x00000080:    dword 1
0x00000350:      0x00000018:    dword 2
0x00000354:      0x01000000: MI_USER_INTERRUPT
0x00000358:      0x02000000: MI_FLUSH
0x0000035c:      0x00000000: MI_NOOP
0x00000360:      0x18800080: MI_BATCH_BUFFER_START
0x00000364:      0x010ac001:    dword 1
0x00000368:      0x02000004: MI_FLUSH
0x0000036c:      0x00000000: MI_NOOP
0x00000370:      0x10800001: MI_STORE_DATA_INDEX
0x00000374:      0x00000080:    dword 1
0x00000378:      0x00000019:    dword 2
0x0000037c:      0x01000000: MI_USER_INTERRUPT
0x00000380:      0x02000000: MI_FLUSH
0x00000384:      0x00000000: MI_NOOP
0x00000388:      0x10800001: MI_STORE_DATA_INDEX
0x0000038c:      0x00000080:    dword 1
0x00000390:      0x0000001a:    dword 2
0x00000394:      0x01000000: MI_USER_INTERRUPT
0x00000398: HEAD 0x00000000: MI_NOOP

I will attach the output of intel_gpu_dump, in case it triggers memories in someone.
Comment 13 Simon Farnsworth 2010-06-29 01:45:02 UTC
Created attachment 36594 [details]
Gzipped output from intel_gpu_dump
Comment 14 Chris Wilson 2010-06-29 02:27:41 UTC
That's normal behaviour of a mostly idle GPU.
Comment 15 Simon Farnsworth 2010-06-29 02:49:28 UTC
So it looks like X isn't responding to me, because it's waiting in the kernel:

(gdb) bt
#0  0x00472424 in __kernel_vsyscall ()
#1  0x009ce1f9 in ioctl () from /lib/libc.so.6
#2  0x00152d8f in drm_intel_gem_bo_mrb_exec2 (bo=0x8b17648, used=264, cliprects=0x0, num_cliprects=0, DR4=-1, ring_flag=1) at intel_bufmgr_gem.c:1608
#3  0x00152fb5 in drm_intel_gem_bo_exec2 (bo=0x8b17648, used=264, cliprects=0x0, num_cliprects=0, DR4=-1) at intel_bufmgr_gem.c:1649
#4  0x0014e59e in drm_intel_bo_exec (bo=0x8b17648, used=264, cliprects=0x0, num_cliprects=0, DR4=-1) at intel_bufmgr.c:145
#5  0x00259a1b in intel_batch_submit (scrn=0x8958050, flush=1) at intel_batchbuffer.c:194
#6  0x002585a5 in I830BlockHandler (i=0, blockData=0x0, pTimeout=0xbfaab5bc, pReadmask=0x81fbe80) at intel_driver.c:704
#7  0x0810f4fb in AnimCurScreenBlockHandler (screenNum=0, blockData=0x0, pTimeout=0xbfaab5bc, pReadmask=0x81fbe80) at animcur.c:194
#8  0x0817f18e in compBlockHandler (i=0, blockData=0x0, pTimeout=0xbfaab5bc, pReadmask=0x81fbe80) at compinit.c:157
#9  0x08062a28 in BlockHandler (pTimeout=0xbfaab5bc, pReadmask=0x81fbe80) at dixutils.c:385
#10 0x080a0e8c in WaitForSomething (pClientsReady=0x8af2208) at WaitFor.c:216
#11 0x0808685e in Dispatch () at dispatch.c:368
#12 0x08062515 in main (argc=15, argv=0xbfaab724, envp=0xbfaab764) at main.c:289

Time to dig and find out what X is doing here.
Comment 16 Simon Farnsworth 2010-06-29 05:19:20 UTC
This looks to be pageflipping related. I did "echo t > /proc/sysrq-trigger", to get the following call trace:

Xorg          S 00000015     0  1380      1 0x00400000
 f69fddc0 00203086 64aadc9c 00000015 c0a4fd40 c0a4fd40 c0a4fd40 c0a4fd40
 f5cfa8ec c0a4fd40 c0a4fd40 00034a30 00000000 f6b04800 00000015 f5cfa640
 00000000 f5cfa640 f69fde20 f72b70a4 f69fde40 f80101b3 00203246 80000000
Call Trace:
 [<f80101b3>] i915_gem_do_execbuffer+0x378/0xbf8 [i915]
 [<f800bce2>] ? list_move_tail+0x18/0x1b [i915]
 [<c04c8f62>] ? __kmalloc+0xfc/0x108
 [<c045212d>] ? autoremove_wake_function+0x0/0x2f
 [<f8010acf>] i915_gem_execbuffer2+0x9c/0xe2 [i915]
 [<f7f85a8c>] drm_ioctl+0x237/0x317 [drm]
 [<f8010a33>] ? i915_gem_execbuffer2+0x0/0xe2 [i915]
 [<c04d198a>] ? fsnotify_modify+0x4f/0x5a
 [<c04dc1dd>] vfs_ioctl+0x27/0x91
 [<f7f85855>] ? drm_ioctl+0x0/0x317 [drm]
 [<c04dc77e>] do_vfs_ioctl+0x48e/0x4cc
 [<c0431dcc>] ? pick_next_task_fair+0xb3/0xbb
 [<c0431df1>] ? pick_next_task+0x1d/0x34
 [<c0786093>] ? schedule+0x585/0x5d9
 [<c04dc7fd>] sys_ioctl+0x41/0x61
 [<c040885f>] sysenter_do_call+0x12/0x28
 [<c0780000>] ? init_intel+0x140/0x355

Disassembling i915.ko in gdb shows me that i915_gem_do_execbuffer+0x378 is in fact part of i915_gem_wait_for_pending_flip, line 3638 (or thereabouts), just after the mutex_lock(&dev->struct_mutex) in:
static int
i915_gem_wait_for_pending_flip(struct drm_device *dev,
                               struct drm_gem_object **object_list,
                               int count)
{
        drm_i915_private_t *dev_priv = dev->dev_private;
        struct drm_i915_gem_object *obj_priv;
        DEFINE_WAIT(wait);
        int i, ret = 0;

        for (;;) {
                prepare_to_wait(&dev_priv->pending_flip_queue,
                                &wait, TASK_INTERRUPTIBLE);
                for (i = 0; i < count; i++) {
                        obj_priv = to_intel_bo(object_list[i]);
                        if (atomic_read(&obj_priv->pending_flip) > 0)
                                break;
                }
                if (i == count)
                        break;

                if (!signal_pending(current)) {
                        mutex_unlock(&dev->struct_mutex);
                        schedule();
                        mutex_lock(&dev->struct_mutex);
                        continue;
                }
                ret = -ERESTARTSYS;
                break;
        }
        finish_wait(&dev_priv->pending_flip_queue, &wait);

        return ret;
}

I'm getting stuck here - any suggestions will be welcomed.
Comment 17 Chris Wilson 2010-06-29 06:42:04 UTC
Assigning to Jesse as he lives for the thrill of broken page flip on i945. At the least he may have some additional patches in his tree for this issue.
Comment 18 Jesse Barnes 2010-06-29 09:39:45 UTC
That trace helps.

One of your processes is waiting for flip completion on a buffer that was just queued.  Which means we never decremented the pending_flip count for the buffer, which means one of several things:
  - failed to prepare the flip which would keep the pending bit from getting set, so intel_finish_page_flip() would never decrement it (no flip pending interrupt?)
  - failed to finish page flip (no vblank interrupt?)
  - failed to wake up the pending flip queue (somehow)

In the drm repo there's a test called vbltest, can you run that (you may need to pass -s depending on your output config) and see if it returns a frequency approximately equal to the display's refresh rate?  If not, there's something wrong with vblank interrupts on your platform that could cause problems.

Assuming that works, can you try the modetest program?  It has a -v flag that lets you check page flipping basics.  If that fails it may be easier to trace than a full stack with your compositor.

If both of those seem ok then we're failing somewhere else.  Tracing the failure points above may shed some light on things...
Comment 19 Simon Farnsworth 2010-06-29 09:59:28 UTC
With X still running, but the world in the failed state, vbltest shows the correct frequency (59.80Hz). modetest -c shows me that the DVI-D connector (the one I'm using) is id 8. modetest -v -s 8:1920x1200 gives me:

trying to load module i915...success.
setting mode 1920x1200 on connector 8, crtc 3
select timed out or error (ret 0)

The select line repeats until I terminate modetest. At the same time, I have a nice colourful picture on screen.

Once I've run modetest, vbltest stops working, and gives output:

trying to load module i915...success.                                                                                                                                                                                                        
starting count: 0                                                                                                                                                                                                                            
select timed out or error (ret 0)                                                                                                                                                                                                            

Again, the select line repeats until I terminate it.

After a power failure, without letting X or my OpenGL compositor run, vbltest works, and shows the correct frequency. When I run modetest, I see the colourful picture briefly, then it flips to a grey screen and stalls. I see the same output as I did after failure. Again, vbltest stops working at this point.

I should add that when vbltest works, I get output like:
trying to load module i915...success.
starting count: 8063
freq: 60.04Hz
freq: 59.80Hz

The second freq: line repeats until I terminate vbltest, and the value of the first freq: line is always slightly different, although still around 60Hz. In addition, the starting count when vbltest works appears to vary in line with system uptime (as you would expect), whereas it's always 0 when it fails.
Comment 20 Simon Farnsworth 2010-06-29 10:08:09 UTC
I've just checked vbltest -s for sanity's sake, and that behaves identically to vbltest once I'm in the failure state.
Comment 21 Jesse Barnes 2010-06-29 10:19:14 UTC
(In reply to comment #19)
> With X still running, but the world in the failed state, vbltest shows the
> correct frequency (59.80Hz). modetest -c shows me that the DVI-D connector (the
> one I'm using) is id 8. modetest -v -s 8:1920x1200 gives me:
> 
> trying to load module i915...success.
> setting mode 1920x1200 on connector 8, crtc 3
> select timed out or error (ret 0)
> 
> The select line repeats until I terminate modetest. At the same time, I have a
> nice colourful picture on screen.

Ok, so that means vblank interrupts work ok until you try to flip, then interrupts break altogether when we try to queue a flip.  If modetest were working, you should see the nice screen alternate with a grey buffer making it look faded out if the flips are occurring at the right frequency.

Did you run your tests with both the kernel patches Chris pointed you at applied?  Is the behavior the same without them?
Comment 22 Simon Farnsworth 2010-06-29 10:50:51 UTC
All tests are currently being run with the patches Chris pointed out in use.
modeset functions correctly if I remove those two patches, and just use the
vanilla kernel, but I get a different failure out of X11. I'll attach a new
intel_error_dump from the new failure state.

XServer log ends with:
[   138.045] (EE) intel(0): Detected a hung GPU, disabling acceleration.
[   138.116] 
Backtrace:
[   138.116] 0: /usr/local/x11test/bin/Xorg (xorg_backtrace+0x3b) [0x80a05fb]
[   138.117] 1: /usr/local/x11test/bin/Xorg (0x8048000+0x54fe5) [0x809cfe5]
[   138.117] 2: (vdso) (__kernel_rt_sigreturn+0x0) [0xc8540c]
[   138.117] 3: /lib/libc.so.6 (__libc_malloc+0x5e) [0x44805e]
[   138.117] 4: /usr/local/x11test/bin/Xorg (AddResource+0x6f) [0x8088c7f]
[   138.117] 5: /usr/local/x11test/lib/xorg/modules/extensions/libglx.so
(0xcc2000+0x315c3) [0xcf35c3]
[   138.117] 6: /usr/local/x11test/lib/xorg/modules/extensions/libglx.so
(0xcc2000+0x33237) [0xcf5237]
[   138.117] 7: /usr/local/x11test/lib/xorg/modules/extensions/libglx.so
(0xcc2000+0x33362) [0xcf5362]
[   138.117] 8: /usr/local/x11test/lib/xorg/modules/extensions/libglx.so
(0xcc2000+0x36392) [0xcf8392]
[   138.117] 9: /usr/local/x11test/bin/Xorg (0x8048000+0x3eba7) [0x8086ba7]
[   138.118] 10: /usr/local/x11test/bin/Xorg (0x8048000+0x1a515) [0x8062515]
[   138.118] 11: /lib/libc.so.6 (__libc_start_main+0xe6) [0x3ebcc6]
[   138.118] 12: /usr/local/x11test/bin/Xorg (0x8048000+0x1a0f1) [0x80620f1]
[   138.118] Segmentation fault at address 0x85a79
[   138.118] 
Fatal server error:
[   138.118] Caught signal 11 (Segmentation fault). Server aborting
[   138.119] 
[   138.119] 

I've now spent far too long at work for one day, so I'm going to go quiet for
the next 14 hours or so - I'll continue working on this at around 10am BST, so
anything you come up with in the meantime will get tested.
Comment 23 Simon Farnsworth 2010-06-29 10:51:30 UTC
Created attachment 36610 [details]
New error state without the patches Chris pointed at
Comment 24 Jesse Barnes 2010-06-29 10:58:29 UTC
Oh, it's also possible we're hanging in the kernel somewhere with interrupts disabled, causing subsequent flip or vblank requests to hang.  Can you check /proc/<pid>/wchan in the failure case as well (or use echo t > /proc/sysrq-trigger like you did before)?

Out of paranoia, you could also try this:

diff --git a/drivers/gpu/drm/i915/intel_display.c b/drivers/gpu/drm/i915/intel_d
index cc8131f..2bfb2b1 100644
--- a/drivers/gpu/drm/i915/intel_display.c
+++ b/drivers/gpu/drm/i915/intel_display.c
@@ -4731,7 +4731,11 @@ static int intel_crtc_page_flip(struct drm_crtc *crtc,
        atomic_inc(&obj_priv->pending_flip);
        work->pending_flip_obj = obj;
 
-       BEGIN_LP_RING(4);
+       BEGIN_LP_RING(8);
+       OUT_RING(MI_FLUSH);
+       OUT_RING(MI_FLUSH);
+       OUT_RING(MI_FLUSH);
+       OUT_RING(MI_FLUSH);
        OUT_RING(MI_DISPLAY_FLIP |
                 MI_DISPLAY_FLIP_PLANE(intel_crtc->plane));
        OUT_RING(fb->pitch);
Comment 25 Chris Wilson 2010-06-29 12:06:47 UTC
I would also include:

https://bugs.freedesktop.org/attachment.cgi?id=35551

in your kernel patchset as that should reduce the number of spurious hangs.
Comment 26 Simon Farnsworth 2010-06-30 01:43:16 UTC
(In reply to comment #24)
> Oh, it's also possible we're hanging in the kernel somewhere with interrupts
> disabled, causing subsequent flip or vblank requests to hang.  Can you check
> /proc/<pid>/wchan in the failure case as well (or use echo t >
> /proc/sysrq-trigger like you did before)?
> 
In the new X failure case (without the patches that Chris pointed out), X dies due to acceleration failure - so I don't have a wchan to chase.

I'm adding the third patch Chris pointed out in this bug (on top of the original 2 that fail), and I'll add your paranoia patch to xf86-video-intel.
Comment 27 Simon Farnsworth 2010-06-30 02:11:24 UTC
Remind me not to try and do things before coffee - I'll add your patch to the *kernel*.
Comment 28 Simon Farnsworth 2010-06-30 02:20:51 UTC
Your patch conflicts with the second patch Chris pointed at me: https://bugs.freedesktop.org/attachment.cgi?id=36464

Not sure how best to proceed - do I modify your patch to apply on top of 36464, or do I drop the three patches Chris pointed out?
Comment 29 Simon Farnsworth 2010-06-30 02:32:05 UTC
On Chris's advice from IRC, I've rebased your suggestion as:
--- intel_display.c.orig        2010-06-30 10:22:40.000000000 +0100
+++ intel_display.c     2010-06-30 10:30:46.274401149 +0100
@@ -4756,7 +4756,11 @@ static int intel_crtc_page_flip(struct d
                while (I915_READ(ISR) & flip_mask)
                        ;
 
-       BEGIN_LP_RING(4);
+       BEGIN_LP_RING(8);
+       OUT_RING(MI_FLUSH);
+       OUT_RING(MI_FLUSH);
+       OUT_RING(MI_FLUSH);
+       OUT_RING(MI_FLUSH);
        if (IS_I965G(dev)) {
                OUT_RING(MI_DISPLAY_FLIP |
                         MI_DISPLAY_FLIP_PLANE(intel_crtc->plane));
Comment 30 Simon Farnsworth 2010-06-30 05:11:39 UTC
Adding the extra flushes on top of the other 3 patches that Chris points out is definitely changing behaviour - instead of locking, Xorg dies, and I get the following in the server log:

[   138.045] (EE) intel(0): Detected a hung GPU, disabling acceleration.
[   138.116] 
Backtrace:
[   138.116] 0: /usr/local/x11test/bin/Xorg (xorg_backtrace+0x3b) [0x80a05fb]
[   138.117] 1: /usr/local/x11test/bin/Xorg (0x8048000+0x54fe5) [0x809cfe5]
[   138.117] 2: (vdso) (__kernel_rt_sigreturn+0x0) [0xc8540c]
[   138.117] 3: /lib/libc.so.6 (__libc_malloc+0x5e) [0x44805e]
[   138.117] 4: /usr/local/x11test/bin/Xorg (AddResource+0x6f) [0x8088c7f]
[   138.117] 5: /usr/local/x11test/lib/xorg/modules/extensions/libglx.so (0xcc2000+0x315c3) [0xcf35c3]
[   138.117] 6: /usr/local/x11test/lib/xorg/modules/extensions/libglx.so (0xcc2000+0x33237) [0xcf5237]
[   138.117] 7: /usr/local/x11test/lib/xorg/modules/extensions/libglx.so (0xcc2000+0x33362) [0xcf5362]
[   138.117] 8: /usr/local/x11test/lib/xorg/modules/extensions/libglx.so (0xcc2000+0x36392) [0xcf8392]
[   138.117] 9: /usr/local/x11test/bin/Xorg (0x8048000+0x3eba7) [0x8086ba7]
[   138.118] 10: /usr/local/x11test/bin/Xorg (0x8048000+0x1a515) [0x8062515]
[   138.118] 11: /lib/libc.so.6 (__libc_start_main+0xe6) [0x3ebcc6]
[   138.118] 12: /usr/local/x11test/bin/Xorg (0x8048000+0x1a0f1) [0x80620f1]
[   138.118] Segmentation fault at address 0x85a79
Comment 31 Chris Wilson 2010-06-30 06:04:00 UTC
(In reply to comment #30)
> Adding the extra flushes on top of the other 3 patches that Chris points out is
> definitely changing behaviour - instead of locking, Xorg dies, and I get the
> following in the server log:
> 
> [   138.045] (EE) intel(0): Detected a hung GPU, disabling acceleration.

We still have a GPU hang, I haven't yet debugged all the error paths that we hit subsequently through dri/glx so these segfaults are an annoyance.

The difference in behaviour I guess is that the hang is detected during a flush so that we don't find ourselves with the hang racing against the flip.

Is the hang dependent upon the flip path at all? Can you reproduce the hang if you do the buffer rotation without the final swap/page flip?
Comment 32 Simon Farnsworth 2010-06-30 06:26:33 UTC
If I patch xf86-video-intel to not use pageflipping, it works, albeit not smoothly.

diff --git a/src/drmmode_display.c b/src/drmmode_display.c
index 17f6541..2f847f7 100644
--- a/src/drmmode_display.c
+++ b/src/drmmode_display.c
@@ -1453,7 +1453,7 @@ Bool drmmode_pre_init(ScrnInfoPtr scrn, int fd, int cpp)
        gp.value = &has_flipping;
        (void)drmCommandWriteRead(intel->drmSubFD, DRM_I915_GETPARAM, &gp,
                                  sizeof(gp));
-       if (has_flipping) {
+       if (has_flipping && 0) {
                xf86DrvMsg(scrn->scrnIndex, X_INFO,
                           "Kernel page flipping support detected, enabling\n");
                intel->use_pageflipping = TRUE;

is the change I made to prevent page flipping being used.

Enabling pageflipping results in the GPU hang - but no error state is collected as far as intel_error_decode is concerned.
Comment 33 Jesse Barnes 2010-06-30 12:13:34 UTC
It's better to disable flipping slightly differently:

diff --git a/src/drmmode_display.c b/src/drmmode_display.c
index d8b158e..e06a2fc 100644
--- a/src/drmmode_display.c
+++ b/src/drmmode_display.c
@@ -1464,7 +1464,7 @@ Bool drmmode_pre_init(ScrnInfoPtr scrn, int fd, int cpp)
 	if (has_flipping) {
 		xf86DrvMsg(scrn->scrnIndex, X_INFO,
 			   "Kernel page flipping support detected, enabling\n");
-		intel->use_pageflipping = TRUE;
+		intel->use_pageflipping = FALSE;
 		drmmode->flip_count = 0;
 		drmmode->event_context.version = DRM_EVENT_CONTEXT_VERSION;
 		drmmode->event_context.vblank_handler = drmmode_vblank_handler;
diff --git a/src/i830_dri.c b/src/i830_dri.c
index 321faf6..d220e3d 100644
--- a/src/i830_dri.c
+++ b/src/i830_dri.c
@@ -1013,7 +1013,7 @@ Bool I830DRI2ScreenInit(ScreenPtr screen)
 
 	info.CopyRegion = I830DRI2CopyRegion;
 #if DRI2INFOREC_VERSION >= 4
-	if (intel->use_pageflipping) {
+	if (intel->use_pageflipping || 1) {
 	    info.version = 4;
 	    info.ScheduleSwap = I830DRI2ScheduleSwap;
 	    info.GetMSC = I830DRI2GetMSC;

That way you keep the other GL features that require vblank events but disable flip ioctls.
Comment 34 Simon Farnsworth 2010-07-01 05:20:06 UTC
A discussion with Jesse on IRC resulted in him noticing that the order of prepare_page_flip and "finish_page_flip" calls in https://bugs.freedesktop.org/attachment.cgi?id=36464 were wrong. I've flipped them round, and now my 945 is page flipping - albeit it struggles to sustain any frame rate worth noting at 1920x1200 (it's fine at 1280x720, so I'm happy to call that a 945 limit).

The faulty hunk is:

diff --git a/drivers/gpu/drm/i915/i915_irq.c b/drivers/gpu/drm/i915/i915_irq.c
index 2479be0..a846cd8 100644
--- a/drivers/gpu/drm/i915/i915_irq.c
+++ b/drivers/gpu/drm/i915/i915_irq.c
@@ -940,22 +940,30 @@ irqreturn_t i915_driver_irq_handler(DRM_IRQ_ARGS)
 		if (HAS_BSD(dev) && (iir & I915_BSD_USER_INTERRUPT))
 			DRM_WAKEUP(&dev_priv->bsd_ring.irq_queue);
 
-		if (iir & I915_DISPLAY_PLANE_A_FLIP_PENDING_INTERRUPT)
+		if (iir & I915_DISPLAY_PLANE_A_FLIP_PENDING_INTERRUPT) {
 			intel_prepare_page_flip(dev, 0);
+			if (dev_priv->flip_pending_is_done)
+				intel_finish_page_flip_plane(dev, 0);
+		}
 
-		if (iir & I915_DISPLAY_PLANE_B_FLIP_PENDING_INTERRUPT)
+		if (iir & I915_DISPLAY_PLANE_B_FLIP_PENDING_INTERRUPT) {
+			if (dev_priv->flip_pending_is_done)
+				intel_finish_page_flip_plane(dev, 1);
 			intel_prepare_page_flip(dev, 1);
+		}
 
 		if (pipea_stats & vblank_status) {
 			vblank++;
 			drm_handle_vblank(dev, 0);
-			intel_finish_page_flip(dev, 0);
+			if (!dev_priv->flip_pending_is_done)
+				intel_finish_page_flip(dev, 0);
 		}
 
 		if (pipeb_stats & vblank_status) {
 			vblank++;
 			drm_handle_vblank(dev, 1);
-			intel_finish_page_flip(dev, 1);
+			if (!dev_priv->flip_pending_is_done)
+				intel_finish_page_flip(dev, 1);
 		}
 
 		if ((pipea_stats & I915_LEGACY_BLC_EVENT_STATUS) ||

Changing it so that "finish_page_flip" calls are always after prepare_page_flip calls makes it work.
Comment 35 Jesse Barnes 2010-07-08 10:08:05 UTC
Marking as fixed per IRC comment.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.