Summary: | [BAT] [HSW] missed breadcrumbs [igt@gem_sync@basic-[all/each] and others fail] | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | Dorota Czaplejewicz <dorota.czaplejewicz> | ||||||
Component: | DRM/Intel | Assignee: | Intel GFX Bugs mailing list <intel-gfx-bugs> | ||||||
Status: | CLOSED WORKSFORME | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> | ||||||
Severity: | major | ||||||||
Priority: | high | CC: | daniel, dorota.czaplejewicz, intel-gfx-bugs, krisman | ||||||
Version: | unspecified | ||||||||
Hardware: | Other | ||||||||
OS: | All | ||||||||
Whiteboard: | |||||||||
i915 platform: | HSW, IVB | i915 features: | GEM/execlists | ||||||
Attachments: |
|
commit 24754d751cb86f6069315d8d613e23afcab06c91 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri Mar 3 14:45:57 2017 +0000 drm/i915: Take reference for signaling the request from hardirq Being inside a spinlock signaling that the hardware just completed a request doesn't prevent a second thread already spotting that the request is complete, freeing it and reallocating it! The code currently tries to prevent this using RCU -- but that only prevents the request from being freed, it doesn't prevent us from reallocating it - that requires us to take a reference. [ 206.922985] BUG: spinlock already unlocked on CPU#4, gem_exec_parall/7796 [ 206.922994] lock: 0xffff8801c6047120, .magic: dead4ead, .owner: <none>/-1, .owner_cpu: -1 [ 206.923000] CPU: 4 PID: 7796 Comm: gem_exec_parall Not tainted 4.10.0-CI-Patchwork_4008+ #1 [ 206.923006] Hardware name: System manufacturer System Product Name/Z170M-PLUS, BIOS 1805 06/20/2016 [ 206.923012] Call Trace: [ 206.923014] <IRQ> [ 206.923019] dump_stack+0x67/0x92 [ 206.923023] spin_dump+0x73/0xc0 [ 206.923027] do_raw_spin_unlock+0x79/0xb0 [ 206.923031] _raw_spin_unlock_irqrestore+0x27/0x60 [ 206.923042] dma_fence_signal+0x160/0x230 [ 206.923060] notify_ring+0xae/0x2e0 [i915] [ 206.923073] ? ibx_hpd_irq_handler+0xc0/0xc0 [i915] [ 206.923086] gen8_gt_irq_handler+0x219/0x290 [i915] [ 206.923100] gen8_irq_handler+0x8e/0x6b0 [i915] [ 206.923105] __handle_irq_event_percpu+0x58/0x370 [ 206.923109] handle_irq_event_percpu+0x1e/0x50 [ 206.923113] handle_irq_event+0x34/0x60 [ 206.923117] handle_edge_irq+0xbe/0x150 [ 206.923122] handle_irq+0x15/0x20 [ 206.923126] do_IRQ+0x63/0x130 [ 206.923142] ? i915_mutex_lock_interruptible+0x39/0x140 [i915] [ 206.923148] common_interrupt+0x90/0x90 [ 206.923153] RIP: 0010:osq_lock+0x77/0x110 [ 206.923157] RSP: 0018:ffffc90001cabaa0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff6e [ 206.923164] RAX: 0000000000000000 RBX: ffff880236d1abc0 RCX: ffff8801ef642fc0 [ 206.923169] RDX: ffff8801ef6427c0 RSI: ffffffff81c6e7fd RDI: ffffffff81c7c848 [ 206.923175] RBP: ffffc90001cabab8 R08: 00000000692bb19b R09: 08c1493200000000 [ 206.923180] R10: 0000000000000001 R11: 0000000000000001 R12: ffff880236cdabc0 [ 206.923185] R13: ffff8802207f00b0 R14: ffffffffa00b7cd9 R15: ffff8802207f0070 [ 206.923191] </IRQ> [ 206.923206] ? i915_mutex_lock_interruptible+0x39/0x140 [i915] [ 206.923213] __mutex_lock+0x649/0x990 [ 206.923217] ? __mutex_lock+0xb0/0x990 [ 206.923221] ? _raw_spin_unlock+0x2c/0x50 [ 206.923226] ? __pm_runtime_resume+0x56/0x80 [ 206.923242] ? i915_mutex_lock_interruptible+0x39/0x140 [i915] [ 206.923249] mutex_lock_interruptible_nested+0x16/0x20 [ 206.923264] i915_mutex_lock_interruptible+0x39/0x140 [i915] [ 206.923270] ? __pm_runtime_resume+0x56/0x80 [ 206.923285] i915_gem_do_execbuffer.isra.15+0x442/0x1d10 [i915] [ 206.923291] ? __lock_acquire+0x449/0x1b50 [ 206.923296] ? __might_fault+0x3e/0x90 [ 206.923301] ? __might_fault+0x87/0x90 [ 206.923305] ? __might_fault+0x3e/0x90 [ 206.923320] i915_gem_execbuffer2+0xb5/0x220 [i915] [ 206.923327] drm_ioctl+0x200/0x450 [ 206.923341] ? i915_gem_execbuffer+0x330/0x330 [i915] [ 206.923348] do_vfs_ioctl+0x90/0x6e0 [ 206.923352] ? __fget+0x108/0x200 [ 206.923356] ? expand_files+0x2b0/0x2b0 [ 206.923361] SyS_ioctl+0x3c/0x70 [ 206.923365] entry_SYSCALL_64_fastpath+0x1c/0xb1 [ 206.923369] RIP: 0033:0x7fdd75fc6357 [ 206.923373] RSP: 002b:00007fdd20e59bf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [ 206.923380] RAX: ffffffffffffffda RBX: ffffffff81481ff3 RCX: 00007fdd75fc6357 [ 206.923385] RDX: 00007fdd20e59c70 RSI: 0000000040406469 RDI: 0000000000000003 [ 206.923390] RBP: ffffc90001cabf88 R08: 0000000000000040 R09: 00000000000003f7 [ 206.923396] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [ 206.923401] R13: 0000000000000003 R14: 0000000040406469 R15: 0000000001cf9cb0 [ 206.923408] ? __this_cpu_preempt_check+0x13/0x20 Fixes: 56299fb7d904 ("drm/i915: Signal first fence from irq handler if complete") Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=100051 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/2017030 Opened the wrong gem_exec_parallel bug. This could just be the other breadcrumbs bug, so let's wait and see. Bisected: commit a402eae64d0ad12b1c4a411f250d6c161e67f623 Merge: 7800fb6 0c744ea Author: Daniel Vetter <daniel.vetter@ffwll.ch> Date: Wed Jan 4 11:34:01 2017 +0100 Merge tag 'v4.10-rc2' into drm-intel-next-queued Backmerge Linux 4.10-rc2 to resync with our -fixes cherry-picks. I've done the backmerge directly because Dave is on vacation. Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> Note about the bisection: I tested both parent commits, and they were both good. (In reply to Dorota Czaplejewicz from comment #3) > Bisected: > > commit a402eae64d0ad12b1c4a411f250d6c161e67f623 > Merge: 7800fb6 0c744ea > Author: Daniel Vetter <daniel.vetter@ffwll.ch> > Date: Wed Jan 4 11:34:01 2017 +0100 > > Merge tag 'v4.10-rc2' into drm-intel-next-queued > > Backmerge Linux 4.10-rc2 to resync with our -fixes cherry-picks. I've > done the backmerge directly because Dave is on vacation. > > Signed-off-by: Daniel Vetter <daniel.vetter@intel.com> Ugh. Cc: Daniel. Nothing in the merge commit, so likely an interaction between too patches. The way to figure this out is probably with bisecting which additional patch breaks stuff: $ git bisect start $ git bisect good $first_parent $ git bisect bad $merge_commit Then create a separate git repo, and for every sha1 that bisect wants you to test do: $ git reset --hard $first_parent $ git merge $sha1_to_test Then run your testcase, go back to the other git checkout where you do the bisecting and report the result. That should give you you an offending commit somewhere in the branch for the 2nd parent. If you then do it the other way round you will find the offending patch in the 1st parent. Those two should then be the 2 patches which interact in a bad way somehow, and probably explain the bug. A more effective way might be to do a full git imerge on both branches, and run the bisect over that. But that's even more fancy ... The bisect is bogus. It is just the expected seqno incoherency that we all know and love. Kernel https://intel-gfx-ci.01.org/CI/CI_DRM_2320/kernel.tar.gz on the same hardware gave 31/32 failures. also seen in igt@gem_sync@basic-each $ sudo ./gem_sync --r basic-each IGT-Version: 1.17-ge8eb9af (x86_64) (Linux: 4.11.0-rc1-CI-CI_DRM_2320+ x86_64) Using Legacy submission , with semaphores bsd completed 53248 cycles: 94.688 us vebox completed 53248 cycles: 94.695 us render completed 53248 cycles: 94.938 us blt completed 32768 cycles: 168.624 us (gem_sync:23461) CRITICAL: Test assertion failure function sync_ring, file gem_sync.c:153: (gem_sync:23461) CRITICAL: Failed assertion: intel_detect_and_clear_missed_interrupts(fd) == 0 (gem_sync:23461) CRITICAL: error: 2 != 0 Stack trace: #0 [__igt_fail_assert+0xf1] #1 [sync_ring+0x2cc] #2 [__real_main722+0x1d7] #3 [main+0x33] #4 [__libc_start_main+0xf1] #5 [_start+0x29] #6 [<unknown>+0x29] Subtest basic-each failed. **** DEBUG **** (gem_sync:23461) CRITICAL: Test assertion failure function sync_ring, file gem_sync.c:153: (gem_sync:23461) CRITICAL: Failed assertion: intel_detect_and_clear_missed_interrupts(fd) == 0 (gem_sync:23461) CRITICAL: error: 2 != 0 **** END **** Subtest basic-each: FAIL (5.575s) Also igt@gem_sync@basic-all: $ sudo ./gem_sync --r basic-all IGT-Version: 1.17-ge8eb9af (x86_64) (Linux: 4.11.0-rc1-CI-CI_DRM_2320+ x86_64) Using Legacy submission , with semaphores Completed 21504 cycles: 235.799 us (gem_sync:32030) CRITICAL: Test assertion failure function sync_all, file gem_sync.c:570: (gem_sync:32030) CRITICAL: Failed assertion: intel_detect_and_clear_missed_interrupts(fd) == 0 (gem_sync:32030) CRITICAL: error: 2 != 0 Stack trace: #0 [__igt_fail_assert+0xf1] #1 [sync_all+0x2ba] #2 [__real_main722+0x2f3] #3 [main+0x33] #4 [__libc_start_main+0xf1] #5 [_start+0x29] #6 [<unknown>+0x29] Subtest basic-all failed. **** DEBUG **** (gem_sync:32030) DEBUG: Test requirement passed: num_engines (gem_sync:32030) CRITICAL: Test assertion failure function sync_all, file gem_sync.c:570: (gem_sync:32030) CRITICAL: Failed assertion: intel_detect_and_clear_missed_interrupts(fd) == 0 (gem_sync:32030) CRITICAL: error: 2 != 0 **** END **** Subtest basic-all: FAIL (5.118s) Good, gem_sync is the test that is explicitly targetting this bug. A very similar unit is present in the CI, and it doesn't fail in the same way: fi-hsw-4770r https://intel-gfx-ci.01.org/CI/hardware.html They seem to be virtually identical judging by dmesg: same DMI device & BIOS (EFI) revision, CPU & GPU, CRTC & connector. Differences: Memory - failing device: [ 0.000000] Memory: 3835416K/4095164K available (8706K kernel code, 1434K rwdata, 3340K rodata, 1192K init, 22624K bss, 259748K reserved, 0K cma-reserved) Memory - CI: [ 0.000000] Memory: 3835416K/4095164K available (8706K kernel code, 1434K rwdata, 3340K rodata, 1192K init, 22624K bss, 259748K reserved, 0K cma-reserved) Display - failing: HDMI -> DVI converter Modeline 0:"1280x1024" 60 108000 1280 1328 1440 1688 1024 1025 1028 1066 0x40 0x5 Display - CI: HDMI (mDP) Modeline 0:"1920x1200" 60 154000 1920 1968 2000 2080 1200 1203 1209 1235 0x40 0x9 Correction Memory - failing device: Memory: 7818184K/8289396K available (8707K kernel code, 1429K rwdata, 3344K rodata, 1196K init, 22624K bss, 471212K reserved, 0K cma-reserved) Created attachment 130260 [details]
`lshw -C memory` on failing host
It shows that memory is populated in dual channel configuration, 2x4GiB
As requested, the memory configuration from fi-hsw-4770r $ sudo lshw -C memory *-firmware description: BIOS vendor: American Megatrends Inc. physical id: 0 version: F6 date: 12/10/2014 size: 64KiB capacity: 4032KiB capabilities: pci upgrade shadowing cdboot bootselect socketedrom edd int13floppy1200 int13floppy720 int13floppy2880 int5printscreen int14serial int17printer acpi usb biosbootspecification uefi *-cache:0 description: L2 cache physical id: 3f slot: CPU Internal L2 size: 1MiB capacity: 1MiB capabilities: internal write-back unified configuration: level=2 *-cache:1 description: L1 cache physical id: 40 slot: CPU Internal L1 size: 256KiB capacity: 256KiB capabilities: internal write-back configuration: level=1 *-cache:2 description: L3 cache physical id: 41 slot: CPU Internal L3 size: 6MiB capacity: 6MiB capabilities: internal write-back unified configuration: level=3 *-memory description: System Memory physical id: 42 slot: System board or motherboard size: 4GiB *-bank:0 description: DIMM [empty] product: [Empty] vendor: [Empty] physical id: 0 serial: [Empty] slot: ChannelA-DIMM0 *-bank:1 description: DIMM [empty] product: [Empty] vendor: [Empty] physical id: 1 serial: [Empty] slot: ChannelA-DIMM1 *-bank:2 description: SODIMM DDR3 Synchronous 1600 MHz (0,6 ns) product: CT51264BF160BJ.C8F vendor: Conexant (Rockwell) physical id: 2 serial: 19080000 slot: ChannelB-DIMM0 size: 4GiB width: 64 bits clock: 1600MHz (0.6ns) *-bank:3 description: DIMM [empty] product: [Empty] vendor: [Empty] physical id: 3 serial: [Empty] slot: ChannelB-DIMM1 The big difference is that the failing host has two DIMMs in dual channel mode. Unfortunately, I'm unable to remove one to repeat the test currently. *** Bug 100055 has been marked as a duplicate of this bug. *** Dorota Czaplejewicz, please re-test with Chris' patch: https://patchwork.freedesktop.org/series/21876/ I tested gem_sync/basic-all and gem_sync/basic-each. They still fail the same way occasionally on IVB as well as HSW. gem_sync/basic-all - roughly 20% fail gem_sync/basic-each - roughly 50% fail Hello, Is there any new information on this case? Thank you. (In reply to Elizabeth from comment #20) > Hello, > Is there any new information on this case? Thank you. No track at https://intel-gfx-ci.01.org/cibuglog/ as known issue. No fails on HSW nor IVB according to: https://intel-gfx-ci.01.org/tree/drm-tip/igt@gem_sync@basic-all.html https://intel-gfx-ci.01.org/tree/drm-tip/igt@gem_sync@basic-each.html Also changing title due to: (In reply to Chris Wilson from comment #11) > Good, gem_sync is the test that is explicitly targetting this bug. No new sights reported for this issue, can we close this bug? Thank you. First of all. Sorry about spam. This is mass update for our bugs. Sorry if you feel this annoying but with this trying to understand if bug still valid or not. If bug investigation still in progress, please ignore this and I apologize! If you think this is not anymore valid, please comment to the bug that can be closed. If you haven't tested with our latest pre-upstream tree(drm-tip), can you do that also to see if issue is valid there still and if you cannot see issue there, please comment to the bug. Closing, please re-open if still occurs. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.
Created attachment 130051 [details] Dmesg (redacted for size) The test igt@gem_exec_parallel@basic fails on HSW i7-4770R about 45% of the time. CPU: i7-4770R Kernel: drm-tip: 2017y-02m-27d-07h-11m-41s OS: Fedora 25 Log: $ sudo ./gem_exec_parallel --r basic IGT-Version: 1.17-gd91fec7 (x86_64) (Linux: 4.10.0-CI-CI_DRM_2269_15ccc64+ x86_64) (gem_exec_parallel:13117) CRITICAL: Test assertion failure function all, file gem_exec_parallel.c:253: (gem_exec_parallel:13117) CRITICAL: Failed assertion: intel_detect_and_clear_missed_interrupts(fd) == 0 (gem_exec_parallel:13117) CRITICAL: error: 16 != 0 Stack trace: #0 [__igt_fail_assert+0xf1] #1 [all+0x491] #2 [<unknown>+0x491] #3 [<unknown>+0x491] Subtest basic failed. **** DEBUG **** (gem_exec_parallel:13117) DEBUG: Test requirement passed: nengine (gem_exec_parallel:13117) DEBUG: Verifying result (pass=0, handle=1) (gem_exec_parallel:13117) DEBUG: Verifying result (pass=1, handle=2) (gem_exec_parallel:13117) DEBUG: Verifying result (pass=2, handle=3) (gem_exec_parallel:13117) DEBUG: Verifying result (pass=3, handle=4) (gem_exec_parallel:13117) DEBUG: Verifying result (pass=4, handle=5) (gem_exec_parallel:13117) DEBUG: Verifying result (pass=5, handle=6) (gem_exec_parallel:13117) DEBUG: Verifying result (pass=6, handle=7) (gem_exec_parallel:13117) DEBUG: Verifying result (pass=7, handle=8) (gem_exec_parallel:13117) DEBUG: Verifying result (pass=8, handle=9) (gem_exec_parallel:13117) DEBUG: Verifying result (pass=9, handle=10) (gem_exec_parallel:13117) DEBUG: Verifying result (pass=10, handle=11) (gem_exec_parallel:13117) DEBUG: Verifying result (pass=11, handle=12) (gem_exec_parallel:13117) DEBUG: Verifying result (pass=12, handle=13) (gem_exec_parallel:13117) DEBUG: Verifying result (pass=13, handle=14) (gem_exec_parallel:13117) DEBUG: Verifying result (pass=14, handle=15) (gem_exec_parallel:13117) DEBUG: Verifying result (pass=15, handle=16) (gem_exec_parallel:13117) CRITICAL: Test assertion failure function all, file gem_exec_parallel.c:253: (gem_exec_parallel:13117) CRITICAL: Failed assertion: intel_detect_and_clear_missed_interrupts(fd) == 0 (gem_exec_parallel:13117) CRITICAL: error: 16 != 0 **** END **** Subtest basic: FAIL (8.760s)