Bug 100052

Summary: [BAT] [HSW] missed breadcrumbs [igt@gem_sync@basic-[all/each] and others fail]
Product: DRI Reporter: Dorota Czaplejewicz <dorota.czaplejewicz>
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: CLOSED WORKSFORME QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: major    
Priority: high CC: daniel, dorota.czaplejewicz, intel-gfx-bugs, krisman
Version: unspecified   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: HSW, IVB i915 features: GEM/execlists
Attachments:
Description Flags
Dmesg (redacted for size)
none
`lshw -C memory` on failing host none

Description Dorota Czaplejewicz 2017-03-03 15:46:06 UTC
Created attachment 130051 [details]
Dmesg (redacted for size)

The test igt@gem_exec_parallel@basic fails on HSW i7-4770R about 45% of the time.

CPU: i7-4770R
Kernel: drm-tip: 2017y-02m-27d-07h-11m-41s
OS: Fedora 25

Log:
$ sudo ./gem_exec_parallel --r basic
IGT-Version: 1.17-gd91fec7 (x86_64) (Linux: 4.10.0-CI-CI_DRM_2269_15ccc64+ x86_64)
(gem_exec_parallel:13117) CRITICAL: Test assertion failure function all, file gem_exec_parallel.c:253:
(gem_exec_parallel:13117) CRITICAL: Failed assertion: intel_detect_and_clear_missed_interrupts(fd) == 0
(gem_exec_parallel:13117) CRITICAL: error: 16 != 0
Stack trace:
  #0 [__igt_fail_assert+0xf1]
  #1 [all+0x491]
  #2 [<unknown>+0x491]
  #3 [<unknown>+0x491]
Subtest basic failed.
**** DEBUG ****
(gem_exec_parallel:13117) DEBUG: Test requirement passed: nengine
(gem_exec_parallel:13117) DEBUG: Verifying result (pass=0, handle=1)
(gem_exec_parallel:13117) DEBUG: Verifying result (pass=1, handle=2)
(gem_exec_parallel:13117) DEBUG: Verifying result (pass=2, handle=3)
(gem_exec_parallel:13117) DEBUG: Verifying result (pass=3, handle=4)
(gem_exec_parallel:13117) DEBUG: Verifying result (pass=4, handle=5)
(gem_exec_parallel:13117) DEBUG: Verifying result (pass=5, handle=6)
(gem_exec_parallel:13117) DEBUG: Verifying result (pass=6, handle=7)
(gem_exec_parallel:13117) DEBUG: Verifying result (pass=7, handle=8)
(gem_exec_parallel:13117) DEBUG: Verifying result (pass=8, handle=9)
(gem_exec_parallel:13117) DEBUG: Verifying result (pass=9, handle=10)
(gem_exec_parallel:13117) DEBUG: Verifying result (pass=10, handle=11)
(gem_exec_parallel:13117) DEBUG: Verifying result (pass=11, handle=12)
(gem_exec_parallel:13117) DEBUG: Verifying result (pass=12, handle=13)
(gem_exec_parallel:13117) DEBUG: Verifying result (pass=13, handle=14)
(gem_exec_parallel:13117) DEBUG: Verifying result (pass=14, handle=15)
(gem_exec_parallel:13117) DEBUG: Verifying result (pass=15, handle=16)
(gem_exec_parallel:13117) CRITICAL: Test assertion failure function all, file gem_exec_parallel.c:253:
(gem_exec_parallel:13117) CRITICAL: Failed assertion: intel_detect_and_clear_missed_interrupts(fd) == 0
(gem_exec_parallel:13117) CRITICAL: error: 16 != 0
****  END  ****
Subtest basic: FAIL (8.760s)
Comment 1 Chris Wilson 2017-03-03 15:59:13 UTC
commit 24754d751cb86f6069315d8d613e23afcab06c91
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Mar 3 14:45:57 2017 +0000

    drm/i915: Take reference for signaling the request from hardirq
    
    Being inside a spinlock signaling that the hardware just completed a
    request doesn't prevent a second thread already spotting that the
    request is complete, freeing it and reallocating it! The code currently
    tries to prevent this using RCU -- but that only prevents the request
    from being freed, it doesn't prevent us from reallocating it - that
    requires us to take a reference.
    
    [  206.922985] BUG: spinlock already unlocked on CPU#4, gem_exec_parall/7796
    [  206.922994]  lock: 0xffff8801c6047120, .magic: dead4ead, .owner: <none>/-1, .owner_cpu: -1
    [  206.923000] CPU: 4 PID: 7796 Comm: gem_exec_parall Not tainted 4.10.0-CI-Patchwork_4008+ #1
    [  206.923006] Hardware name: System manufacturer System Product Name/Z170M-PLUS, BIOS 1805 06/20/2016
    [  206.923012] Call Trace:
    [  206.923014]  <IRQ>
    [  206.923019]  dump_stack+0x67/0x92
    [  206.923023]  spin_dump+0x73/0xc0
    [  206.923027]  do_raw_spin_unlock+0x79/0xb0
    [  206.923031]  _raw_spin_unlock_irqrestore+0x27/0x60
    [  206.923042]  dma_fence_signal+0x160/0x230
    [  206.923060]  notify_ring+0xae/0x2e0 [i915]
    [  206.923073]  ? ibx_hpd_irq_handler+0xc0/0xc0 [i915]
    [  206.923086]  gen8_gt_irq_handler+0x219/0x290 [i915]
    [  206.923100]  gen8_irq_handler+0x8e/0x6b0 [i915]
    [  206.923105]  __handle_irq_event_percpu+0x58/0x370
    [  206.923109]  handle_irq_event_percpu+0x1e/0x50
    [  206.923113]  handle_irq_event+0x34/0x60
    [  206.923117]  handle_edge_irq+0xbe/0x150
    [  206.923122]  handle_irq+0x15/0x20
    [  206.923126]  do_IRQ+0x63/0x130
    [  206.923142]  ? i915_mutex_lock_interruptible+0x39/0x140 [i915]
    [  206.923148]  common_interrupt+0x90/0x90
    [  206.923153] RIP: 0010:osq_lock+0x77/0x110
    [  206.923157] RSP: 0018:ffffc90001cabaa0 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff6e
    [  206.923164] RAX: 0000000000000000 RBX: ffff880236d1abc0 RCX: ffff8801ef642fc0
    [  206.923169] RDX: ffff8801ef6427c0 RSI: ffffffff81c6e7fd RDI: ffffffff81c7c848
    [  206.923175] RBP: ffffc90001cabab8 R08: 00000000692bb19b R09: 08c1493200000000
    [  206.923180] R10: 0000000000000001 R11: 0000000000000001 R12: ffff880236cdabc0
    [  206.923185] R13: ffff8802207f00b0 R14: ffffffffa00b7cd9 R15: ffff8802207f0070
    [  206.923191]  </IRQ>
    [  206.923206]  ? i915_mutex_lock_interruptible+0x39/0x140 [i915]
    [  206.923213]  __mutex_lock+0x649/0x990
    [  206.923217]  ? __mutex_lock+0xb0/0x990
    [  206.923221]  ? _raw_spin_unlock+0x2c/0x50
    [  206.923226]  ? __pm_runtime_resume+0x56/0x80
    [  206.923242]  ? i915_mutex_lock_interruptible+0x39/0x140 [i915]
    [  206.923249]  mutex_lock_interruptible_nested+0x16/0x20
    [  206.923264]  i915_mutex_lock_interruptible+0x39/0x140 [i915]
    [  206.923270]  ? __pm_runtime_resume+0x56/0x80
    [  206.923285]  i915_gem_do_execbuffer.isra.15+0x442/0x1d10 [i915]
    [  206.923291]  ? __lock_acquire+0x449/0x1b50
    [  206.923296]  ? __might_fault+0x3e/0x90
    [  206.923301]  ? __might_fault+0x87/0x90
    [  206.923305]  ? __might_fault+0x3e/0x90
    [  206.923320]  i915_gem_execbuffer2+0xb5/0x220 [i915]
    [  206.923327]  drm_ioctl+0x200/0x450
    [  206.923341]  ? i915_gem_execbuffer+0x330/0x330 [i915]
    [  206.923348]  do_vfs_ioctl+0x90/0x6e0
    [  206.923352]  ? __fget+0x108/0x200
    [  206.923356]  ? expand_files+0x2b0/0x2b0
    [  206.923361]  SyS_ioctl+0x3c/0x70
    [  206.923365]  entry_SYSCALL_64_fastpath+0x1c/0xb1
    [  206.923369] RIP: 0033:0x7fdd75fc6357
    [  206.923373] RSP: 002b:00007fdd20e59bf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
    [  206.923380] RAX: ffffffffffffffda RBX: ffffffff81481ff3 RCX: 00007fdd75fc6357
    [  206.923385] RDX: 00007fdd20e59c70 RSI: 0000000040406469 RDI: 0000000000000003
    [  206.923390] RBP: ffffc90001cabf88 R08: 0000000000000040 R09: 00000000000003f7
    [  206.923396] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
    [  206.923401] R13: 0000000000000003 R14: 0000000040406469 R15: 0000000001cf9cb0
    [  206.923408]  ? __this_cpu_preempt_check+0x13/0x20
    
    Fixes: 56299fb7d904 ("drm/i915: Signal first fence from irq handler if complete")
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=100051
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Link: http://patchwork.freedesktop.org/patch/msgid/2017030
Comment 2 Chris Wilson 2017-03-03 16:03:17 UTC
Opened the wrong gem_exec_parallel bug. This could just be the other breadcrumbs bug, so let's wait and see.
Comment 3 Dorota Czaplejewicz 2017-03-06 11:09:51 UTC
Bisected:

commit a402eae64d0ad12b1c4a411f250d6c161e67f623
Merge: 7800fb6 0c744ea
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Wed Jan 4 11:34:01 2017 +0100

    Merge tag 'v4.10-rc2' into drm-intel-next-queued
    
    Backmerge Linux 4.10-rc2 to resync with our -fixes cherry-picks. I've
    done the backmerge directly because Dave is on vacation.
    
    Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>
Comment 4 Dorota Czaplejewicz 2017-03-06 16:47:21 UTC
Note about the bisection: I tested both parent commits, and they were both good.
Comment 5 Jani Nikula 2017-03-07 18:57:43 UTC
(In reply to Dorota Czaplejewicz from comment #3)
> Bisected:
> 
> commit a402eae64d0ad12b1c4a411f250d6c161e67f623
> Merge: 7800fb6 0c744ea
> Author: Daniel Vetter <daniel.vetter@ffwll.ch>
> Date:   Wed Jan 4 11:34:01 2017 +0100
> 
>     Merge tag 'v4.10-rc2' into drm-intel-next-queued
>     
>     Backmerge Linux 4.10-rc2 to resync with our -fixes cherry-picks. I've
>     done the backmerge directly because Dave is on vacation.
>     
>     Signed-off-by: Daniel Vetter <daniel.vetter@intel.com>

Ugh. Cc: Daniel.
Comment 6 Daniel Vetter 2017-03-07 19:44:42 UTC
Nothing in the merge commit, so likely an interaction between too patches. The way to figure this out is probably with bisecting which additional patch breaks stuff:

$ git bisect start
$ git bisect good $first_parent
$ git bisect bad $merge_commit

Then create a separate git repo, and for every sha1 that bisect wants you to test do:

$ git reset --hard $first_parent
$ git merge $sha1_to_test

Then run your testcase, go back to the other git checkout where you do the bisecting and report the result. That should give you you an offending commit somewhere in the branch for the 2nd parent.

If you then do it the other way round you will find the offending patch in the 1st parent. Those two should then be the 2 patches which interact in a bad way somehow, and probably explain the bug.

A more effective way might be to do a full git imerge on both branches, and run the bisect over that. But that's even more fancy ...
Comment 7 Chris Wilson 2017-03-07 20:40:43 UTC
The bisect is bogus. It is just the expected seqno incoherency that we all know and love.
Comment 8 Dorota Czaplejewicz 2017-03-10 15:52:49 UTC
Kernel https://intel-gfx-ci.01.org/CI/CI_DRM_2320/kernel.tar.gz on the same hardware gave 31/32 failures.
Comment 9 Dorota Czaplejewicz 2017-03-10 16:41:48 UTC
also seen in igt@gem_sync@basic-each

$ sudo ./gem_sync --r basic-each
IGT-Version: 1.17-ge8eb9af (x86_64) (Linux: 4.11.0-rc1-CI-CI_DRM_2320+ x86_64)
Using Legacy submission , with semaphores
bsd completed 53248 cycles: 94.688 us
vebox completed 53248 cycles: 94.695 us
render completed 53248 cycles: 94.938 us
blt completed 32768 cycles: 168.624 us
(gem_sync:23461) CRITICAL: Test assertion failure function sync_ring, file gem_sync.c:153:
(gem_sync:23461) CRITICAL: Failed assertion: intel_detect_and_clear_missed_interrupts(fd) == 0
(gem_sync:23461) CRITICAL: error: 2 != 0
Stack trace:
  #0 [__igt_fail_assert+0xf1]
  #1 [sync_ring+0x2cc]
  #2 [__real_main722+0x1d7]
  #3 [main+0x33]
  #4 [__libc_start_main+0xf1]
  #5 [_start+0x29]
  #6 [<unknown>+0x29]
Subtest basic-each failed.
**** DEBUG ****
(gem_sync:23461) CRITICAL: Test assertion failure function sync_ring, file gem_sync.c:153:
(gem_sync:23461) CRITICAL: Failed assertion: intel_detect_and_clear_missed_interrupts(fd) == 0
(gem_sync:23461) CRITICAL: error: 2 != 0
****  END  ****
Subtest basic-each: FAIL (5.575s)
Comment 10 Dorota Czaplejewicz 2017-03-10 16:49:16 UTC
Also igt@gem_sync@basic-all:

$ sudo ./gem_sync --r basic-all
IGT-Version: 1.17-ge8eb9af (x86_64) (Linux: 4.11.0-rc1-CI-CI_DRM_2320+ x86_64)
Using Legacy submission , with semaphores
Completed 21504 cycles: 235.799 us
(gem_sync:32030) CRITICAL: Test assertion failure function sync_all, file gem_sync.c:570:
(gem_sync:32030) CRITICAL: Failed assertion: intel_detect_and_clear_missed_interrupts(fd) == 0
(gem_sync:32030) CRITICAL: error: 2 != 0
Stack trace:
  #0 [__igt_fail_assert+0xf1]
  #1 [sync_all+0x2ba]
  #2 [__real_main722+0x2f3]
  #3 [main+0x33]
  #4 [__libc_start_main+0xf1]
  #5 [_start+0x29]
  #6 [<unknown>+0x29]
Subtest basic-all failed.
**** DEBUG ****
(gem_sync:32030) DEBUG: Test requirement passed: num_engines
(gem_sync:32030) CRITICAL: Test assertion failure function sync_all, file gem_sync.c:570:
(gem_sync:32030) CRITICAL: Failed assertion: intel_detect_and_clear_missed_interrupts(fd) == 0
(gem_sync:32030) CRITICAL: error: 2 != 0
****  END  ****
Subtest basic-all: FAIL (5.118s)
Comment 11 Chris Wilson 2017-03-10 16:50:14 UTC
Good, gem_sync is the test that is explicitly targetting this bug.
Comment 12 Dorota Czaplejewicz 2017-03-16 15:26:57 UTC
A very similar unit is present in the CI, and it doesn't fail in the same way: fi-hsw-4770r https://intel-gfx-ci.01.org/CI/hardware.html

They seem to be virtually identical judging by dmesg: same DMI device & BIOS (EFI) revision, CPU & GPU, CRTC & connector.

Differences:

Memory - failing device:
[    0.000000] Memory: 3835416K/4095164K available (8706K kernel code, 1434K rwdata, 3340K rodata, 1192K init, 22624K bss, 259748K reserved, 0K cma-reserved)

Memory - CI:
[    0.000000] Memory: 3835416K/4095164K available (8706K kernel code, 1434K rwdata, 3340K rodata, 1192K init, 22624K bss, 259748K reserved, 0K cma-reserved)

Display - failing:
HDMI -> DVI converter
Modeline 0:"1280x1024" 60 108000 1280 1328 1440 1688 1024 1025 1028 1066 0x40 0x5

Display - CI:
HDMI (mDP)
Modeline 0:"1920x1200" 60 154000 1920 1968 2000 2080 1200 1203 1209 1235 0x40 0x9
Comment 13 Dorota Czaplejewicz 2017-03-16 15:30:54 UTC
Correction

Memory - failing device:
Memory: 7818184K/8289396K available (8707K kernel code, 1429K rwdata, 3344K rodata, 1196K init, 22624K bss, 471212K reserved, 0K cma-reserved)
Comment 14 Dorota Czaplejewicz 2017-03-16 15:50:12 UTC
Created attachment 130260 [details]
`lshw -C memory` on failing host

It shows that memory is populated in dual channel configuration, 2x4GiB
Comment 15 Tomi Sarvela 2017-03-16 15:53:11 UTC
As requested, the memory configuration from fi-hsw-4770r

$ sudo lshw -C memory
  *-firmware              
       description: BIOS
       vendor: American Megatrends Inc.
       physical id: 0
       version: F6
       date: 12/10/2014
       size: 64KiB
       capacity: 4032KiB
       capabilities: pci upgrade shadowing cdboot bootselect socketedrom edd int13floppy1200 int13floppy720 int13floppy2880 int5printscreen int14serial int17printer acpi usb biosbootspecification uefi
  *-cache:0
       description: L2 cache
       physical id: 3f
       slot: CPU Internal L2
       size: 1MiB
       capacity: 1MiB
       capabilities: internal write-back unified
       configuration: level=2
  *-cache:1
       description: L1 cache
       physical id: 40
       slot: CPU Internal L1
       size: 256KiB
       capacity: 256KiB
       capabilities: internal write-back
       configuration: level=1
  *-cache:2
       description: L3 cache
       physical id: 41
       slot: CPU Internal L3
       size: 6MiB
       capacity: 6MiB
       capabilities: internal write-back unified
       configuration: level=3
  *-memory
       description: System Memory
       physical id: 42
       slot: System board or motherboard
       size: 4GiB
     *-bank:0
          description: DIMM [empty]
          product: [Empty]
          vendor: [Empty]
          physical id: 0
          serial: [Empty]
          slot: ChannelA-DIMM0
     *-bank:1
          description: DIMM [empty]
          product: [Empty]
          vendor: [Empty]
          physical id: 1
          serial: [Empty]
          slot: ChannelA-DIMM1
     *-bank:2
          description: SODIMM DDR3 Synchronous 1600 MHz (0,6 ns)
          product: CT51264BF160BJ.C8F
          vendor: Conexant (Rockwell)
          physical id: 2
          serial: 19080000
          slot: ChannelB-DIMM0
          size: 4GiB
          width: 64 bits
          clock: 1600MHz (0.6ns)
     *-bank:3
          description: DIMM [empty]
          product: [Empty]
          vendor: [Empty]
          physical id: 3
          serial: [Empty]
          slot: ChannelB-DIMM1
Comment 16 Dorota Czaplejewicz 2017-03-16 15:57:15 UTC
The big difference is that the failing host has two DIMMs in dual channel mode.

Unfortunately, I'm unable to remove one to repeat the test currently.
Comment 17 Chris Wilson 2017-03-18 17:36:55 UTC
*** Bug 100055 has been marked as a duplicate of this bug. ***
Comment 18 yann 2017-03-27 16:38:19 UTC
Dorota Czaplejewicz, please re-test with Chris' patch: https://patchwork.freedesktop.org/series/21876/
Comment 19 Dorota Czaplejewicz 2017-03-27 18:32:27 UTC
I tested gem_sync/basic-all and gem_sync/basic-each. They still fail the same way occasionally on IVB as well as HSW.

gem_sync/basic-all - roughly 20% fail
gem_sync/basic-each - roughly 50% fail
Comment 20 Elizabeth 2017-06-27 19:53:30 UTC
Hello,
Is there any new information on this case? Thank you.
Comment 21 Elizabeth 2017-10-06 17:26:23 UTC
(In reply to Elizabeth from comment #20)
> Hello,
> Is there any new information on this case? Thank you.
No track at https://intel-gfx-ci.01.org/cibuglog/ as known issue.

No fails on HSW nor IVB according to:
https://intel-gfx-ci.01.org/tree/drm-tip/igt@gem_sync@basic-all.html
https://intel-gfx-ci.01.org/tree/drm-tip/igt@gem_sync@basic-each.html

Also changing title due to:
(In reply to Chris Wilson from comment #11)
> Good, gem_sync is the test that is explicitly targetting this bug.
Comment 22 Elizabeth 2018-03-05 20:02:08 UTC
No new sights reported for this issue, can we close this bug? Thank you.
Comment 23 Jani Saarinen 2018-03-29 07:10:45 UTC
First of all. Sorry about spam.
This is mass update for our bugs. 

Sorry if you feel this annoying but with this trying to understand if bug still valid or not.
If bug investigation still in progress, please ignore this and I apologize!

If you think this is not anymore valid, please comment to the bug that can be closed.
If you haven't tested with our latest pre-upstream tree(drm-tip), can you do that also to see if issue is valid there still and if you cannot see issue there, please comment to the bug.
Comment 24 Jani Saarinen 2018-04-20 14:06:06 UTC
Closing, please re-open if still occurs.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.