Bug 50069 - [IVB] hard hang with i-g-t/flip-test
[IVB] hard hang with i-g-t/flip-test
Status: VERIFIED FIXED
Product: DRI
Classification: Unclassified
Component: DRM/Intel
unspecified
Other All
: medium normal
Assigned To: Daniel Vetter
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2012-05-17 18:40 UTC by Guang Yang
Modified: 2012-10-15 03:37 UTC (History)
5 users (show)

See Also:
i915 platform:
i915 features:


Attachments
after running flip_test dmesg info (3.68 KB, text/plain)
2012-05-17 18:40 UTC, Guang Yang
no flags Details
runnint case with drm.debug=7 's debug info (122.45 KB, text/plain)
2012-05-22 20:24 UTC, Guang Yang
no flags Details
trace-cmd report info (2.89 MB, text/plain)
2012-05-22 21:39 UTC, Guang Yang
no flags Details
Make the lock for pageflips interruptible (1.44 KB, patch)
2012-05-23 03:08 UTC, Chris Wilson
no flags Details | Splinter Review
fixup ivb plane 3 pageflips (2.92 KB, patch)
2012-05-23 05:16 UTC, Daniel Vetter
no flags Details | Splinter Review
dmesg file when running the flip_test (119.21 KB, text/plain)
2012-06-06 23:44 UTC, Yi Sun
no flags Details
zip format of perf.data (1.60 MB, application/octet-stream)
2012-06-14 20:31 UTC, Guang Yang
no flags Details
running case's decoded (1.43 MB, text/plain)
2012-06-18 02:08 UTC, Guang Yang
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Guang Yang 2012-05-17 18:40:36 UTC
Created attachment 61771 [details]
after running flip_test dmesg info

System Environment:
--------------------------
Platform:        Ivbridge(chief river)
Kernel: (drm-intel-next-queued)659900dd3fa07b4ddffebfff07f6ee0341988943
Bug detailed description:
-------------------------
   On IVB platform ,running flip_test of the Intel-gpu-tools will never finish.
   On the console shows this:
    1366x768 60 1366 1398 1422 1426 768 771 775 806 0x9 0x48 69000
        page flipping on crtc 3, connector 9: PASSED
    1366x768 60 1366 1398 1422 1426 768 771 775 806 0x9 0x48 69000  
   then it stays there.
   In dmesg shows:
[  191.198279] ------------[ cut here ]------------
[  191.198316] kernel BUG at drivers/gpu/drm/i915/i915_gem.c:3019!
[  191.198357] invalid opcode: 0000 [#1] SMP 
[  191.198393] CPU 1 
[  191.198409] Modules linked in: sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf dm_mirror dm_region_hash dm_log asix usbnet snd_hda_codec_idt mii snd_hda_intel snd_hda_codec snd_hwdep snd_seq coretemp hwmon snd_seq_device microcode sg snd_pcm snd_timer snd pcspkr i2c_i801 e1000e soundcore snd_page_alloc iTCO_wdt iTCO_vendor_support battery ac ipv6 autofs4 ext3 jbd mbcache sr_mod cdrom sd_mod ehci_hcd i915 drm_kms_helper drm i2c_algo_bit button i2c_core video dm_mod [last unloaded: scsi_wait_scan]
[  191.199042] 
[  191.199076] Pid: 3366, comm: flip_test Not tainted 3.4.0-rc6_unstable_20120517+ #1 Intel Corporation 2012 Client Platform/Yellow Bluff FFVS
[  191.199256] RIP: 0010:[<ffffffffa009e35c>]  [<ffffffffa009e35c>] i915_gem_object_pin+0x28/0x129 [i915]
[  191.199413] RSP: 0018:ffff880146e57bc8  EFLAGS: 00010246
[  191.199480] RAX: 000000000001e000 RBX: ffff880132020348 RCX: ffff880132020348
[  191.199583] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff880132020348
[  191.199669] RBP: ffff880146e57bf8 R08: 0000000000000001 R09: ffff880134024000
[  191.199766] R10: ffff88014e22b000 R11: ffff88014682b738 R12: ffff880134025f01
[  191.199861] R13: 0000000000000000 R14: ffff880132020348 R15: ffff880132020348
[  191.199949] FS:  00007f891689b8a0(0000) GS:ffff88014e220000(0000) knlGS:0000000000000000
[  191.200066] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[  191.200137] CR2: 0000000001903430 CR3: 0000000146e55000 CR4: 00000000001407e0
[  191.200245] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  191.200332] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  191.200427] Process flip_test (pid: 3366, threadinfo ffff880146e56000, task ffff88013233d320)
[  191.200535] Stack:
[  191.200564]  0000000000000001 ffff880132020348 0000000000000001 ffff880132020348
[  191.200692]  ffff880134025f18 0000000000000000 ffff880146e57c38 ffffffffa009e4ca
[  191.200820]  0000000000000000 0000000000000246 ffffffffa00aed18 ffff880132020348
[  191.200954] Call Trace:
[  191.200977]  [<ffffffffa009e4ca>] i915_gem_object_pin_to_display_plane+0x6d/0xd0 [i915]
[  191.201054]  [<ffffffffa00aed18>] ? intel_crtc_page_flip+0x150/0x2e3 [i915]
[  191.201116]  [<ffffffffa00abb25>] intel_pin_and_fence_fb_obj+0x83/0xe3 [i915]
[  191.201234]  [<ffffffffa00abbf0>] intel_gen7_queue_flip+0x31/0xb9 [i915]
[  191.201349]  [<ffffffffa00aed72>] intel_crtc_page_flip+0x1aa/0x2e3 [i915]
[  191.201462]  [<ffffffffa00495e0>] drm_mode_page_flip_ioctl+0x1ea/0x241 [drm]
[  191.201568]  [<ffffffffa003b7e4>] drm_ioctl+0x2d8/0x3a4 [drm]
[  191.201665]  [<ffffffffa00493f6>] ? drm_mode_gamma_get_ioctl+0x10c/0x10c [drm]
[  191.201770]  [<ffffffff81073420>] ? trace_hardirqs_on_caller+0x121/0x158
[  191.201863]  [<ffffffff81073464>] ? trace_hardirqs_on+0xd/0xf
[  191.201929]  [<ffffffff8110acb4>] do_vfs_ioctl+0x464/0x4a5
[  191.201966]  [<ffffffff810fcfc2>] ? fcheck_files+0xa9/0xe7
[  191.201998]  [<ffffffff8110ad46>] sys_ioctl+0x51/0x77
[  191.202031]  [<ffffffff813ed222>] system_call_fastpath+0x16/0x1b
[  191.202066] Code: 5f 5d c3 55 48 89 e5 41 55 41 89 f5 41 54 41 88 d4 53 48 89 fb 48 83 ec 18 8b 87 f0 00 00 00 25 00 e0 01 00 3d 00 e0 01 00 75 02 <0f> 0b 48 83 bf 98 00 00 00 00 74 67 85 f6 74 0c 41 8d 55 ff 85 
[  191.205855] RIP  [<ffffffffa009e35c>] i915_gem_object_pin+0x28/0x129 [i915]
[  191.209246]  RSP <ffff880146e57bc8>
[  191.212607] ---[ end trace 580741cea45733f6 ]---
Comment 1 Daniel Vetter 2012-05-19 13:10:51 UTC
Yet another pin leak.
Comment 2 Daniel Vetter 2012-05-19 13:12:08 UTC
Quick question: Does this happen always when you run the test (i.e. also when you run this test manually on an otherwise idle machine)?
Comment 3 Guang Yang 2012-05-20 18:38:20 UTC
(In reply to comment #2)
> Quick question: Does this happen always when you run the test (i.e. also when
> you run this test manually on an otherwise idle machine)? 
 Yes,it will happen always when running this case manually on an otherwise idle machine.
Comment 4 Chris Wilson 2012-05-21 03:46:22 UTC
What is the output configuration? (./testdisplay -i will suffice).
Comment 5 Guang Yang 2012-05-21 20:26:54 UTC
(In reply to comment #4)
> What is the output configuration? (./testdisplay -i will suffice).
Here is the output configuration:
Connectors:
id	encoder	status		type	size (mm)	modes
9	10	connected	Embedded DisplayPort	340x190		2
  modes:
  name refresh (Hz) hdisp hss hse htot vdisp vss vse vtot flags type clock
  1366x768 60 1366 1398 1422 1426 768 771 775 806 0x9 0x48 69000
  1366x768 40 1366 1398 1422 1426 768 771 775 806 0x9 0x40 46000
13	0	disconnected	VGA	0x0		0
16	0	disconnected	HDMI-A	0x0		0
17	0	disconnected	DisplayPort	0x0		0
20	0	disconnected	HDMI-A	0x0		0
22	0	disconnected	HDMI-A	0x0		0
23	0	disconnected	DisplayPort	0x0		0
25	0	disconnected	DisplayPort	0x0		0

CRTCs:
id	fb	pos	size
3	32	(0,0)	(0x0)
  1366x768 60 1366 1398 1422 1426 768 771 775 806 0x9 0x48 69000
5	0	(0,0)	(0x0)
   0 0 0 0 0 0 0 0 0 0x0 0x0 0
7	0	(0,0)	(0x0)
   0 0 0 0 0 0 0 0 0 0x0 0x0 0
Comment 6 Chris Wilson 2012-05-22 07:22:52 UTC
Ok, I was expecting a more complex configuration. Can't spot a reason why it should work for me and not for you. So can you attach a full drm.debug dmesg (better make it drm.debug=7) and also trace-cmd record -e drm -e i915 ./flip-test
Comment 7 Guang Yang 2012-05-22 20:24:55 UTC
Created attachment 61992 [details]
runnint case with drm.debug=7 's debug info
Comment 8 Guang Yang 2012-05-22 21:39:07 UTC
Created attachment 61996 [details]
trace-cmd report info

(In reply to comment #6)
> Ok, I was expecting a more complex configuration. Can't spot a reason why it
> should work for me and not for you. So can you attach a full drm.debug dmesg
> (better make it drm.debug=7) and also trace-cmd record -e drm -e i915
> ./flip-test
I try with trace-cmd and it shows below,then I attach the trace report:
/sys/kernel/debug/tracing/events/i915/filter
/sys/kernel/debug/tracing/events/*/i915/filter
/sys/kernel/debug/tracing/events/drm/filter
/sys/kernel/debug/tracing/events/*/drm/filter
  1366x768 60 1366 1398 1422 1426 768 771 775 806 0x9 0x48 69000
page flipping on crtc 3, connector 9: PASSED
  1366x768 60 1366 1398 1422 1426 768 771 775 806 0x9 0x48 69000
^C^C^C^C^C^C^C^C
Kernel buffer statistics:
  Note: "entries" are the entries left in the kernel ring buffer and are not
        recorded in the trace data. They should all be zero.

CPU: 0
entries: 161
overrun: 0
commit overrun: 0
bytes: 6792
oldest event ts:   511.323942
now ts:   512.218856

CPU: 1
entries: 0
overrun: 0
commit overrun: 0
bytes: 2176
oldest event ts:   496.830697
now ts:   512.219338

CPU: 2
entries: 1001
overrun: 0
commit overrun: 0
bytes: 36336
oldest event ts:   510.037205
now ts:   512.219820

CPU: 3
entries: 0
overrun: 0
commit overrun: 0
bytes: 776
oldest event ts:   488.917924
now ts:   512.220280

CPU: 4
entries: 314
overrun: 0
commit overrun: 0
bytes: 13572
oldest event ts:   511.952632
now ts:   512.220796

CPU: 5
entries: 0
overrun: 0
commit overrun: 0
bytes: 940
oldest event ts:   493.860700
now ts:   512.221259

CPU: 6
entries: 348
overrun: 0
commit overrun: 0
bytes: 15012
oldest event ts:   511.848710
now ts:   512.221734

CPU: 7
entries: 0
overrun: 0
commit overrun: 0
bytes: 1144
oldest event ts:   489.550560
now ts:   512.222190

^CCPU0 data recorded at offset=0x27c000
    720896 bytes in size
CPU1 data recorded at offset=0x32c000
    57344 bytes in size
CPU2 data recorded at offset=0x33a000
    192512 bytes in size
CPU3 data recorded at offset=0x369000
    4096 bytes in size
CPU4 data recorded at offset=0x36a000
    40960 bytes in size
CPU5 data recorded at offset=0x374000
    4096 bytes in size
CPU6 data recorded at offset=0x375000
    69632 bytes in size
CPU7 data recorded at offset=0x386000
    4096 bytes in size
Comment 9 Chris Wilson 2012-05-23 03:07:47 UTC
So judging by the traces we submit a large number of pageflips but only process one request (which never completes). Yet for the pin-count to be incremented means that we pass the test for incomplete pageflip and since we do not emit a tracepoint we must fail between pinning the obj and emitting the request (which also presumes that a previously unnoticed failure occurs for the outstanding work.) This is many impossible steps. :|
Comment 10 Chris Wilson 2012-05-23 03:08:38 UTC
Created attachment 62008 [details] [review]
Make the lock for pageflips interruptible
Comment 11 Daniel Vetter 2012-05-23 05:16:34 UTC
Created attachment 62015 [details] [review]
fixup ivb plane 3 pageflips

Please try this patch on top of Chris' patch.
Comment 12 Guang Yang 2012-05-24 01:04:29 UTC
I try the patch of chris on comment 10,the flip_test still can't finish,and the dmesg show the same call trace as comment 0.
when I try Daniel's patch on top of Chris' patch on comment 11,the kernel with these two patches will kernel panic after grub selecting.
Comment 13 Daniel Vetter 2012-05-24 01:43:53 UTC
Hm, if this machine dies right after boot, then something very fishy is going on - this patch only affects the pageflip code, so that should only get run when a 3d compositor (or 3d app) finally starts.

Can you please double check whether this is not a hw problem (memtest&friends) and maybe compile the kernel with a different gcc to rule out compiler issues?
Comment 14 Guang Yang 2012-05-24 23:11:05 UTC
(In reply to comment #13)
> Hm, if this machine dies right after boot, then something very fishy is going
> on - this patch only affects the pageflip code, so that should only get run
> when a 3d compositor (or 3d app) finally starts.
> Can you please double check whether this is not a hw problem (memtest&friends)
> and maybe compile the kernel with a different gcc to rule out compiler issues?

I try to rebuild the kernel with two patches again, it can correctly work this time, but the case still can't finish and have the same call trace info as comment0
Comment 15 Daniel Vetter 2012-06-06 06:29:34 UTC
Fix merged to -fixes, confirmed that it works by Eugeni.

commit cb05d8dedefa3066bf5d74ef88c6ca6cf4bd1c87
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Wed May 23 14:02:00 2012 +0200

    drm/i915: fix up ivb plane 3 pageflips

I have no idea what's going on with your ivb system, but I suspect some hw flakiness. If that is ruled out (because you can reproduce this on other ivb machines, too), please reopen this bug.
Comment 16 Yi Sun 2012-06-06 23:44:57 UTC
Created attachment 62690 [details]
dmesg file when running the flip_test

(In reply to comment #15)
> Fix merged to -fixes, confirmed that it works by Eugeni.
> 
> commit cb05d8dedefa3066bf5d74ef88c6ca6cf4bd1c87
> Author: Daniel Vetter <daniel.vetter@ffwll.ch>
> Date:   Wed May 23 14:02:00 2012 +0200
> 
>     drm/i915: fix up ivb plane 3 pageflips
> 
> I have no idea what's going on with your ivb system, but I suspect some hw
> flakiness. If that is ruled out (because you can reproduce this on other ivb
> machines, too), please reopen this bug.

We can reproduce this issue on our two IVB machines stably. 
The output is:

[root@x-ivb12 tests]# ./flip_test
  1366x768 60 1366 1398 1422 1426 768 771 775 806 0x9 0x48 69000
page flipping on crtc 3, connector 9: PASSED
  1366x768 60 1366 1398 1422 1426 768 771 775 806 0x9 0x48 69000
page flipping on crtc 5, connector 9: PASSED
  1366x768 60 1366 1398 1422 1426 768 771 775 806 0x9 0x48 69000

The dmesg is attached.

When this issue happens, the system hang. Even can't be rebooted, we have to power off psychically.
Comment 17 Daniel Vetter 2012-06-14 10:53:45 UTC
Can you plese run

perf record -e i915:* tests/flip_test

and then attach the output of

perf script

(you might need to compress this).

Judging from dmesg we have a runaway interrupt or something equally bad going on. And our code sucks because it doesn't rate-limit our pageflipping to the speed of our unpinning.
Comment 18 Guang Yang 2012-06-14 20:31:12 UTC
Created attachment 63054 [details]
zip format of perf.data

(In reply to comment #17)
> Can you plese run
> 
> perf record -e i915:* tests/flip_test
> 
> and then attach the output of
> 
> perf script
> 
> (you might need to compress this).
> 
> Judging from dmesg we have a runaway interrupt or something equally bad going
> on. And our code sucks because it doesn't rate-limit our pageflipping to the
> speed of our unpinning.
I attach the zip format of pref.data which getting while running that case.
Comment 19 Daniel Vetter 2012-06-15 02:32:26 UTC
You need to run perf script to decode the perf.data file - it only contains binary data which is only valid for exactly the kernel you've gathered it's output on (aside: this is done for performance reason so that gathering these traces taxes the system as little as possible). To avoid decoding the perf.data on the wrong kernel I think it'd be safest if you re-run the perf record command, too.
Comment 20 Guang Yang 2012-06-17 18:45:26 UTC
(In reply to comment #19)
> You need to run perf script to decode the perf.data file - it only contains
> binary data which is only valid for exactly the kernel you've gathered it's
> output on (aside: this is done for performance reason so that gathering these
> traces taxes the system as little as possible). To avoid decoding the perf.data
> on the wrong kernel I think it'd be safest if you re-run the perf record
> command, too.
Dianel, I'm comfused ,do you mean I need to run the  perf record command twice to get the right perf.data?
Comment 21 Daniel Vetter 2012-06-18 00:45:40 UTC
(In reply to comment #20)
> Dianel, I'm comfused ,do you mean I need to run the  perf record command twice
> to get the right perf.data?

The problem is that perf.data is kernel-specific, so I can't use that. You need to decode it with

$ perf script > perf.decoded

This will automatically pick perf.data from the current directory. But I guess that the machine now runs a different nightly kernel build, so hence you first need to record a new perf.data trace with

$ perf record -e i915:* tests/flip_test

because decoding the old perf.data on a new kernel can lead to a bogus decode.
Comment 22 Guang Yang 2012-06-18 02:08:48 UTC
Created attachment 63163 [details]
running case's decoded

(In reply to comment #21)
> The problem is that perf.data is kernel-specific, so I can't use that. You need
> to decode it with
> 
> $ perf script > perf.decoded
> 
> This will automatically pick perf.data from the current directory. But I guess
> that the machine now runs a different nightly kernel build, so hence you first
> need to record a new perf.data trace with
> 
> $ perf record -e i915:* tests/flip_test
> 
> because decoding the old perf.data on a new kernel can lead to a bogus decode.

Here is the perf.decoded getting from running that case.
Comment 23 Florian Mickler 2012-08-05 11:24:04 UTC
A patch referencing this bug report has been merged in Linux v3.6-rc1:

commit 79158103bf20c9ff55204ceb78a5773f422b03be
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed May 23 11:13:58 2012 +0100

    drm/i915: Make the lock for pageflips interruptible
Comment 24 Daniel Vetter 2012-08-22 10:25:37 UTC
This is funny, we pageflip _way_ too fast afaict ...
Comment 25 Chris Wilson 2012-08-22 10:36:19 UTC
No, that's just the i-g-t not reporting the error return and trying again in spite of the error.
Comment 26 Chris Wilson 2012-09-28 12:46:47 UTC
i-g-t/flip-test updated and hopefully hardened against its own mistakes.
Comment 27 Guang Yang 2012-10-15 03:37:45 UTC
The issue is gone with newest -testing