Created attachment 61771 [details] after running flip_test dmesg info System Environment: -------------------------- Platform: Ivbridge(chief river) Kernel: (drm-intel-next-queued)659900dd3fa07b4ddffebfff07f6ee0341988943 Bug detailed description: ------------------------- On IVB platform ,running flip_test of the Intel-gpu-tools will never finish. On the console shows this: 1366x768 60 1366 1398 1422 1426 768 771 775 806 0x9 0x48 69000 page flipping on crtc 3, connector 9: PASSED 1366x768 60 1366 1398 1422 1426 768 771 775 806 0x9 0x48 69000 then it stays there. In dmesg shows: [ 191.198279] ------------[ cut here ]------------ [ 191.198316] kernel BUG at drivers/gpu/drm/i915/i915_gem.c:3019! [ 191.198357] invalid opcode: 0000 [#1] SMP [ 191.198393] CPU 1 [ 191.198409] Modules linked in: sunrpc cpufreq_ondemand acpi_cpufreq freq_table mperf dm_mirror dm_region_hash dm_log asix usbnet snd_hda_codec_idt mii snd_hda_intel snd_hda_codec snd_hwdep snd_seq coretemp hwmon snd_seq_device microcode sg snd_pcm snd_timer snd pcspkr i2c_i801 e1000e soundcore snd_page_alloc iTCO_wdt iTCO_vendor_support battery ac ipv6 autofs4 ext3 jbd mbcache sr_mod cdrom sd_mod ehci_hcd i915 drm_kms_helper drm i2c_algo_bit button i2c_core video dm_mod [last unloaded: scsi_wait_scan] [ 191.199042] [ 191.199076] Pid: 3366, comm: flip_test Not tainted 3.4.0-rc6_unstable_20120517+ #1 Intel Corporation 2012 Client Platform/Yellow Bluff FFVS [ 191.199256] RIP: 0010:[<ffffffffa009e35c>] [<ffffffffa009e35c>] i915_gem_object_pin+0x28/0x129 [i915] [ 191.199413] RSP: 0018:ffff880146e57bc8 EFLAGS: 00010246 [ 191.199480] RAX: 000000000001e000 RBX: ffff880132020348 RCX: ffff880132020348 [ 191.199583] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff880132020348 [ 191.199669] RBP: ffff880146e57bf8 R08: 0000000000000001 R09: ffff880134024000 [ 191.199766] R10: ffff88014e22b000 R11: ffff88014682b738 R12: ffff880134025f01 [ 191.199861] R13: 0000000000000000 R14: ffff880132020348 R15: ffff880132020348 [ 191.199949] FS: 00007f891689b8a0(0000) GS:ffff88014e220000(0000) knlGS:0000000000000000 [ 191.200066] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [ 191.200137] CR2: 0000000001903430 CR3: 0000000146e55000 CR4: 00000000001407e0 [ 191.200245] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 191.200332] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [ 191.200427] Process flip_test (pid: 3366, threadinfo ffff880146e56000, task ffff88013233d320) [ 191.200535] Stack: [ 191.200564] 0000000000000001 ffff880132020348 0000000000000001 ffff880132020348 [ 191.200692] ffff880134025f18 0000000000000000 ffff880146e57c38 ffffffffa009e4ca [ 191.200820] 0000000000000000 0000000000000246 ffffffffa00aed18 ffff880132020348 [ 191.200954] Call Trace: [ 191.200977] [<ffffffffa009e4ca>] i915_gem_object_pin_to_display_plane+0x6d/0xd0 [i915] [ 191.201054] [<ffffffffa00aed18>] ? intel_crtc_page_flip+0x150/0x2e3 [i915] [ 191.201116] [<ffffffffa00abb25>] intel_pin_and_fence_fb_obj+0x83/0xe3 [i915] [ 191.201234] [<ffffffffa00abbf0>] intel_gen7_queue_flip+0x31/0xb9 [i915] [ 191.201349] [<ffffffffa00aed72>] intel_crtc_page_flip+0x1aa/0x2e3 [i915] [ 191.201462] [<ffffffffa00495e0>] drm_mode_page_flip_ioctl+0x1ea/0x241 [drm] [ 191.201568] [<ffffffffa003b7e4>] drm_ioctl+0x2d8/0x3a4 [drm] [ 191.201665] [<ffffffffa00493f6>] ? drm_mode_gamma_get_ioctl+0x10c/0x10c [drm] [ 191.201770] [<ffffffff81073420>] ? trace_hardirqs_on_caller+0x121/0x158 [ 191.201863] [<ffffffff81073464>] ? trace_hardirqs_on+0xd/0xf [ 191.201929] [<ffffffff8110acb4>] do_vfs_ioctl+0x464/0x4a5 [ 191.201966] [<ffffffff810fcfc2>] ? fcheck_files+0xa9/0xe7 [ 191.201998] [<ffffffff8110ad46>] sys_ioctl+0x51/0x77 [ 191.202031] [<ffffffff813ed222>] system_call_fastpath+0x16/0x1b [ 191.202066] Code: 5f 5d c3 55 48 89 e5 41 55 41 89 f5 41 54 41 88 d4 53 48 89 fb 48 83 ec 18 8b 87 f0 00 00 00 25 00 e0 01 00 3d 00 e0 01 00 75 02 <0f> 0b 48 83 bf 98 00 00 00 00 74 67 85 f6 74 0c 41 8d 55 ff 85 [ 191.205855] RIP [<ffffffffa009e35c>] i915_gem_object_pin+0x28/0x129 [i915] [ 191.209246] RSP <ffff880146e57bc8> [ 191.212607] ---[ end trace 580741cea45733f6 ]---
Yet another pin leak.
Quick question: Does this happen always when you run the test (i.e. also when you run this test manually on an otherwise idle machine)?
(In reply to comment #2) > Quick question: Does this happen always when you run the test (i.e. also when > you run this test manually on an otherwise idle machine)? Yes,it will happen always when running this case manually on an otherwise idle machine.
What is the output configuration? (./testdisplay -i will suffice).
(In reply to comment #4) > What is the output configuration? (./testdisplay -i will suffice). Here is the output configuration: Connectors: id encoder status type size (mm) modes 9 10 connected Embedded DisplayPort 340x190 2 modes: name refresh (Hz) hdisp hss hse htot vdisp vss vse vtot flags type clock 1366x768 60 1366 1398 1422 1426 768 771 775 806 0x9 0x48 69000 1366x768 40 1366 1398 1422 1426 768 771 775 806 0x9 0x40 46000 13 0 disconnected VGA 0x0 0 16 0 disconnected HDMI-A 0x0 0 17 0 disconnected DisplayPort 0x0 0 20 0 disconnected HDMI-A 0x0 0 22 0 disconnected HDMI-A 0x0 0 23 0 disconnected DisplayPort 0x0 0 25 0 disconnected DisplayPort 0x0 0 CRTCs: id fb pos size 3 32 (0,0) (0x0) 1366x768 60 1366 1398 1422 1426 768 771 775 806 0x9 0x48 69000 5 0 (0,0) (0x0) 0 0 0 0 0 0 0 0 0 0x0 0x0 0 7 0 (0,0) (0x0) 0 0 0 0 0 0 0 0 0 0x0 0x0 0
Ok, I was expecting a more complex configuration. Can't spot a reason why it should work for me and not for you. So can you attach a full drm.debug dmesg (better make it drm.debug=7) and also trace-cmd record -e drm -e i915 ./flip-test
Created attachment 61992 [details] runnint case with drm.debug=7 's debug info
Created attachment 61996 [details] trace-cmd report info (In reply to comment #6) > Ok, I was expecting a more complex configuration. Can't spot a reason why it > should work for me and not for you. So can you attach a full drm.debug dmesg > (better make it drm.debug=7) and also trace-cmd record -e drm -e i915 > ./flip-test I try with trace-cmd and it shows below,then I attach the trace report: /sys/kernel/debug/tracing/events/i915/filter /sys/kernel/debug/tracing/events/*/i915/filter /sys/kernel/debug/tracing/events/drm/filter /sys/kernel/debug/tracing/events/*/drm/filter 1366x768 60 1366 1398 1422 1426 768 771 775 806 0x9 0x48 69000 page flipping on crtc 3, connector 9: PASSED 1366x768 60 1366 1398 1422 1426 768 771 775 806 0x9 0x48 69000 ^C^C^C^C^C^C^C^C Kernel buffer statistics: Note: "entries" are the entries left in the kernel ring buffer and are not recorded in the trace data. They should all be zero. CPU: 0 entries: 161 overrun: 0 commit overrun: 0 bytes: 6792 oldest event ts: 511.323942 now ts: 512.218856 CPU: 1 entries: 0 overrun: 0 commit overrun: 0 bytes: 2176 oldest event ts: 496.830697 now ts: 512.219338 CPU: 2 entries: 1001 overrun: 0 commit overrun: 0 bytes: 36336 oldest event ts: 510.037205 now ts: 512.219820 CPU: 3 entries: 0 overrun: 0 commit overrun: 0 bytes: 776 oldest event ts: 488.917924 now ts: 512.220280 CPU: 4 entries: 314 overrun: 0 commit overrun: 0 bytes: 13572 oldest event ts: 511.952632 now ts: 512.220796 CPU: 5 entries: 0 overrun: 0 commit overrun: 0 bytes: 940 oldest event ts: 493.860700 now ts: 512.221259 CPU: 6 entries: 348 overrun: 0 commit overrun: 0 bytes: 15012 oldest event ts: 511.848710 now ts: 512.221734 CPU: 7 entries: 0 overrun: 0 commit overrun: 0 bytes: 1144 oldest event ts: 489.550560 now ts: 512.222190 ^CCPU0 data recorded at offset=0x27c000 720896 bytes in size CPU1 data recorded at offset=0x32c000 57344 bytes in size CPU2 data recorded at offset=0x33a000 192512 bytes in size CPU3 data recorded at offset=0x369000 4096 bytes in size CPU4 data recorded at offset=0x36a000 40960 bytes in size CPU5 data recorded at offset=0x374000 4096 bytes in size CPU6 data recorded at offset=0x375000 69632 bytes in size CPU7 data recorded at offset=0x386000 4096 bytes in size
So judging by the traces we submit a large number of pageflips but only process one request (which never completes). Yet for the pin-count to be incremented means that we pass the test for incomplete pageflip and since we do not emit a tracepoint we must fail between pinning the obj and emitting the request (which also presumes that a previously unnoticed failure occurs for the outstanding work.) This is many impossible steps. :|
Created attachment 62008 [details] [review] Make the lock for pageflips interruptible
Created attachment 62015 [details] [review] fixup ivb plane 3 pageflips Please try this patch on top of Chris' patch.
I try the patch of chris on comment 10,the flip_test still can't finish,and the dmesg show the same call trace as comment 0. when I try Daniel's patch on top of Chris' patch on comment 11,the kernel with these two patches will kernel panic after grub selecting.
Hm, if this machine dies right after boot, then something very fishy is going on - this patch only affects the pageflip code, so that should only get run when a 3d compositor (or 3d app) finally starts. Can you please double check whether this is not a hw problem (memtest&friends) and maybe compile the kernel with a different gcc to rule out compiler issues?
(In reply to comment #13) > Hm, if this machine dies right after boot, then something very fishy is going > on - this patch only affects the pageflip code, so that should only get run > when a 3d compositor (or 3d app) finally starts. > Can you please double check whether this is not a hw problem (memtest&friends) > and maybe compile the kernel with a different gcc to rule out compiler issues? I try to rebuild the kernel with two patches again, it can correctly work this time, but the case still can't finish and have the same call trace info as comment0
Fix merged to -fixes, confirmed that it works by Eugeni. commit cb05d8dedefa3066bf5d74ef88c6ca6cf4bd1c87 Author: Daniel Vetter <daniel.vetter@ffwll.ch> Date: Wed May 23 14:02:00 2012 +0200 drm/i915: fix up ivb plane 3 pageflips I have no idea what's going on with your ivb system, but I suspect some hw flakiness. If that is ruled out (because you can reproduce this on other ivb machines, too), please reopen this bug.
Created attachment 62690 [details] dmesg file when running the flip_test (In reply to comment #15) > Fix merged to -fixes, confirmed that it works by Eugeni. > > commit cb05d8dedefa3066bf5d74ef88c6ca6cf4bd1c87 > Author: Daniel Vetter <daniel.vetter@ffwll.ch> > Date: Wed May 23 14:02:00 2012 +0200 > > drm/i915: fix up ivb plane 3 pageflips > > I have no idea what's going on with your ivb system, but I suspect some hw > flakiness. If that is ruled out (because you can reproduce this on other ivb > machines, too), please reopen this bug. We can reproduce this issue on our two IVB machines stably. The output is: [root@x-ivb12 tests]# ./flip_test 1366x768 60 1366 1398 1422 1426 768 771 775 806 0x9 0x48 69000 page flipping on crtc 3, connector 9: PASSED 1366x768 60 1366 1398 1422 1426 768 771 775 806 0x9 0x48 69000 page flipping on crtc 5, connector 9: PASSED 1366x768 60 1366 1398 1422 1426 768 771 775 806 0x9 0x48 69000 The dmesg is attached. When this issue happens, the system hang. Even can't be rebooted, we have to power off psychically.
Can you plese run perf record -e i915:* tests/flip_test and then attach the output of perf script (you might need to compress this). Judging from dmesg we have a runaway interrupt or something equally bad going on. And our code sucks because it doesn't rate-limit our pageflipping to the speed of our unpinning.
Created attachment 63054 [details] zip format of perf.data (In reply to comment #17) > Can you plese run > > perf record -e i915:* tests/flip_test > > and then attach the output of > > perf script > > (you might need to compress this). > > Judging from dmesg we have a runaway interrupt or something equally bad going > on. And our code sucks because it doesn't rate-limit our pageflipping to the > speed of our unpinning. I attach the zip format of pref.data which getting while running that case.
You need to run perf script to decode the perf.data file - it only contains binary data which is only valid for exactly the kernel you've gathered it's output on (aside: this is done for performance reason so that gathering these traces taxes the system as little as possible). To avoid decoding the perf.data on the wrong kernel I think it'd be safest if you re-run the perf record command, too.
(In reply to comment #19) > You need to run perf script to decode the perf.data file - it only contains > binary data which is only valid for exactly the kernel you've gathered it's > output on (aside: this is done for performance reason so that gathering these > traces taxes the system as little as possible). To avoid decoding the perf.data > on the wrong kernel I think it'd be safest if you re-run the perf record > command, too. Dianel, I'm comfused ,do you mean I need to run the perf record command twice to get the right perf.data?
(In reply to comment #20) > Dianel, I'm comfused ,do you mean I need to run the perf record command twice > to get the right perf.data? The problem is that perf.data is kernel-specific, so I can't use that. You need to decode it with $ perf script > perf.decoded This will automatically pick perf.data from the current directory. But I guess that the machine now runs a different nightly kernel build, so hence you first need to record a new perf.data trace with $ perf record -e i915:* tests/flip_test because decoding the old perf.data on a new kernel can lead to a bogus decode.
Created attachment 63163 [details] running case's decoded (In reply to comment #21) > The problem is that perf.data is kernel-specific, so I can't use that. You need > to decode it with > > $ perf script > perf.decoded > > This will automatically pick perf.data from the current directory. But I guess > that the machine now runs a different nightly kernel build, so hence you first > need to record a new perf.data trace with > > $ perf record -e i915:* tests/flip_test > > because decoding the old perf.data on a new kernel can lead to a bogus decode. Here is the perf.decoded getting from running that case.
A patch referencing this bug report has been merged in Linux v3.6-rc1: commit 79158103bf20c9ff55204ceb78a5773f422b03be Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Wed May 23 11:13:58 2012 +0100 drm/i915: Make the lock for pageflips interruptible
This is funny, we pageflip _way_ too fast afaict ...
No, that's just the i-g-t not reporting the error return and trying again in spite of the error.
i-g-t/flip-test updated and hopefully hardened against its own mistakes.
The issue is gone with newest -testing
Closing old verified.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.