Bug 111380 - [CI][DRMTIP] igt@gem_busy@close-race - fail - Failed assertion: !"GPU hung"
Summary: [CI][DRMTIP] igt@gem_busy@close-race - fail - Failed assertion: !"GPU hung"
Status: RESOLVED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: Other All
: medium normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: ReadyForDev
Keywords:
Depends on:
Blocks:
 
Reported: 2019-08-12 12:31 UTC by Lakshmi
Modified: 2019-08-12 14:37 UTC (History)
1 user (show)

See Also:
i915 platform: G45
i915 features: GPU hang


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Lakshmi 2019-08-12 12:31:47 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_340/fi-elk-e7500/igt@gem_busy@close-race.html
Starting subtest: close-race
(gem_busy:1018) igt_aux-CRITICAL: Test assertion failure function sig_abort, file ../lib/igt_aux.c:502:
(gem_busy:1018) igt_aux-CRITICAL: Failed assertion: !"GPU hung"
Subtest close-race failed.
**** DEBUG ****
(gem_busy:1018) drmtest-DEBUG: Test requirement passed: !(fd<0)
(gem_busy:1018) igt_dummyload-DEBUG: Test requirement passed: vgem_has_fences(cork->vgem.device)
(gem_busy:1018) igt_debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0'
(gem_busy:1018) DEBUG: Test requirement passed: ncpus > 1
(gem_busy:1018) igt_core-DEBUG: Test requirement passed: !igt_run_in_simulation()
(gem_busy:1018) intel_os-DEBUG: Checking 153 surfaces of size 4096 bytes (total 708608) against RAM
(gem_busy:1018) drmtest-DEBUG: Test requirement passed: !(fd<0)
(gem_busy:1018) igt_debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0'
(gem_busy:1018) igt_debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0'
(gem_busy:1018) intel_os-DEBUG: Test requirement passed: sufficient_memory
(gem_busy:1018) DEBUG: Test requirement passed: nengine
(gem_busy:1018) igt_aux-CRITICAL: Test assertion failure function sig_abort, file ../lib/igt_aux.c:502:
(gem_busy:1018) igt_aux-CRITICAL: Failed assertion: !"GPU hung"
(gem_busy:1018) igt_core-INFO: Stack trace:
(gem_busy:1018) igt_core-INFO:   #0 ../lib/igt_core.c:1674 __igt_fail_assert()
(gem_busy:1018) igt_core-INFO:   #1 [sig_abort+0x3a]
(gem_busy:1018) igt_core-INFO:   #2 [killpg+0x40]
(gem_busy:1018) igt_core-INFO:   #3 ../sysdeps/unix/sysv/linux/wait.c:29 wait()
(gem_busy:1018) igt_core-INFO:   #4 ../lib/igt_core.c:1957 __igt_waitchildren()
(gem_busy:1018) igt_core-INFO:   #5 ../lib/igt_core.c:2008 igt_waitchildren()
(gem_busy:1018) igt_core-INFO:   #6 ../tests/i915/gem_busy.c:393 __real_main471()
(gem_busy:1018) igt_core-INFO:   #7 ../tests/i915/gem_busy.c:471 main()
(gem_busy:1018) igt_core-INFO:   #8 ../csu/libc-start.c:344 __libc_start_main()
(gem_busy:1018) igt_core-INFO:   #9 [_start+0x2a]
****  END  ****
Subtest close-race: FAIL (17.207s)
gem_busy: ../lib/igt_core.c:1730: igt_exit: Assertion `!num_test_children' failed.
Received signal SIGABRT.
Stack trace: 
 #0 [fatal_sig_handler+0xd6]
 #1 [killpg+0x40]
 #2 [gsignal+0xc7]
 #3 [abort+0x141]
 #4 [uselocale+0x33a]
 #5 [__assert_fail+0x42]
 #6 [igt_exit+0x133]
 #7 [main+0x2c]
 #8 [__libc_start_main+0xe7]
 #9 [_start+0x2a]
Comment 1 CI Bug Log 2019-08-12 12:33:34 UTC
The CI Bug Log issue associated to this bug has been updated.

### New filters associated

* ELK:  igt@gem_busy@close-race - fail - Failed assertion: !&quot;GPU hung&quot;
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_341/fi-elk-e7500/igt@gem_busy@close-race.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_340/fi-elk-e7500/igt@gem_busy@close-race.html
Comment 2 Chris Wilson 2019-08-12 13:48:20 UTC
Hmm. Hangcheck looks healthy enough. Our survival is dependent on

                igt_until_timeout(20) {
                        for (i = 0; i < nhandles; i++) {
                                igt_spin_free(fd, spin[i]);
                                spin[i] = __igt_spin_new(fd,
                                                         .engine = engines[rand() % nengine]);
                                handles[i] = spin[i]->handle;
                                __sync_synchronize();
                        }
                        count += nhandles;
                }

not blocking and killing all handles every 6s (or hangcheck frequency).

And nhandles is gem_measure_ring_inflight(). Ok, I think this is another off-by-one due to the ring wrapping.
Comment 3 Chris Wilson 2019-08-12 14:37:18 UTC
I think

commit a49a3a6cdbc4949c0ae8df5f3d8c3e476aefdea1 (HEAD, upstream/master)
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Aug 12 11:30:28 2019 +0100

    lib/i915: Trim ring measurement by one
    
    Be a little more conservative in our ring measurement and exclude one
    batch to leave room in case our user needs to wrap (where a request will
    be expanded to cover the unused space at the end of the ring).
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=111374
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>

should solve this.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.