Bug 95048 - tests/gem_close_race, threads SEGFAULT on BDW, GPU hang SKL, machine hang on SNB
Summary: tests/gem_close_race, threads SEGFAULT on BDW, GPU hang SKL, machine hang on SNB
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: All All
: high critical
Assignee: Chris Wilson
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-04-21 11:04 UTC by Marius Vlad
Modified: 2017-07-03 10:47 UTC (History)
3 users (show)

See Also:
i915 platform: BDW, SKL, SNB
i915 features: GEM/Other, GPU hang


Attachments
gpu hang skl 4405Y (392.64 KB, text/plain)
2016-04-21 11:04 UTC, Marius Vlad
no flags Details
skl-4405y dmesg (54.53 KB, text/plain)
2016-04-21 12:23 UTC, Daniela Prodan
no flags Details
bdw-i7-5600u dmesg (36.93 KB, text/plain)
2016-04-21 12:24 UTC, Daniela Prodan
no flags Details
core file (16.57 MB, application/octet-stream)
2016-04-21 14:03 UTC, Marius Vlad
no flags Details
gem_close_race binary to be used w/ core (812.02 KB, application/octet-stream)
2016-04-21 14:04 UTC, Marius Vlad
no flags Details
dmesg with 0xff debug (39.77 KB, text/plain)
2016-04-21 16:26 UTC, Marius Vlad
no flags Details

Description Marius Vlad 2016-04-21 11:04:15 UTC
Created attachment 123112 [details]
gpu hang skl 4405Y

tests/gem_close_race, basic-threads hangs the machine on SNB, and crashes on BDW and SKL w/ SIGSEGV. Managed to get a GPU hang from another SLK.
Comment 1 Daniela Prodan 2016-04-21 12:23:45 UTC
Created attachment 123113 [details]
skl-4405y dmesg
Comment 2 Daniela Prodan 2016-04-21 12:24:22 UTC
Created attachment 123114 [details]
bdw-i7-5600u dmesg
Comment 3 Marius Vlad 2016-04-21 14:03:02 UTC
[New LWP 6945]
[New LWP 6943]
[New LWP 6944]
Core was generated by `/opt/igt/tests/gem_close_race --run-subtest basic-threads'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007ffc4b93e440 in ?? ()
[Current thread is 1 (LWP 6945)]
(gdb) file gem_close_race 
warning: exec file is newer than core file.
Reading symbols from gem_close_race...done.
(gdb) thread apply all bt full

Thread 3 (LWP 6944):
#0  0x00007f80db1bd0b7 in ?? ()
No symbol table info available.
#1  0x00007f80dbac5908 in ?? ()
No symbol table info available.
#2  0x00007f80d9afde40 in ?? ()
No symbol table info available.
#3  0x000000000000005d in ?? ()
No symbol table info available.
#4  0x0000000000000030 in ?? ()
No symbol table info available.
#5  0x0000000000000005 in ?? ()
No symbol table info available.
#6  0x00007f80d9afdd60 in ?? ()
No symbol table info available.
#7  0x00000000004044c0 in selfcopy (fd=1077961833, fd@entry=5, handle=<optimized out>, loops=92, loops@entry=100) at gem_close_race.c:117
        reloc = {{target_handle = 1, delta = 0, offset = 16, presumed_offset = 4294701056, read_domains = 2, write_domain = 2}, {target_handle = 1, delta = 0, offset = 32, presumed_offset = 4294701056, 
            read_domains = 2, write_domain = 0}}
        gem_exec = {{handle = 1, relocation_count = 0, relocs_ptr = 0, alignment = 0, offset = 4294701056, flags = 0, rsvd1 = 0, rsvd2 = 0}, {handle = 2, relocation_count = 2, relocs_ptr = 140191384722896, 
            alignment = 0, offset = 4294696960, flags = 0, rsvd1 = 0, rsvd2 = 0}}
        execbuf = {buffers_ptr = 140191384722960, buffer_count = 2, batch_start_offset = 0, batch_len = 48, DR1 = 0, DR4 = 0, num_cliprects = 0, cliprects_ptr = 0, flags = 3, rsvd1 = 0, rsvd2 = 0}
        gem_pwrite = {handle = 2, pad = 0, offset = 0, size = 48, data_ptr = 140191384722848}
        create = {size = 4096, handle = 2, pad = 0}
        buf = {1425014792, 63705088, 0, 66560, 0, 0, 33554432, 4096, 0, 0, 83886080, 0}
        b = <optimized out>
#8  0x0000000000404ab8 in thread_run (_data=0x233c030) at gem_close_race.c:174
        fd = 5
        n = 6
        handle = 1
        arg = {name = 1, handle = 1, size = 262144}
#9  0x00007f80daca16aa in ?? ()
No symbol table info available.
#10 0x0000000000000000 in ?? ()
No symbol table info available.

Thread 2 (LWP 6943):
#0  0x00007f80db0f53f1 in ?? ()
No symbol table info available.
#1  0x0000000000000000 in ?? ()
No symbol table info available.

Thread 1 (LWP 6945):
#0  0x00007ffc4b93e440 in ?? ()
No symbol table info available.
#1  0x0000000000000000 in ?? ()
No symbol table info available.
Comment 4 Marius Vlad 2016-04-21 14:03:34 UTC
Created attachment 123118 [details]
core file
Comment 5 Marius Vlad 2016-04-21 14:04:03 UTC
Created attachment 123119 [details]
gem_close_race binary to be used w/ core
Comment 6 Chris Wilson 2016-04-21 14:32:17 UTC
Off the top of my head: i915.enable_ppgtt=2 (i.e. disable 48bit ppgtt) then i915.enable_ppgtt=1 ? Do they make the SIGGEV and gpu hang go away on bdw/skl?
Comment 7 Marius Vlad 2016-04-21 16:01:03 UTC
No, same thing.
Comment 8 Chris Wilson 2016-04-21 16:17:26 UTC
Last one in terms of options: i915.enable_execlists=0

That should be the difference between bdw/skl and the rest... I hope.
Comment 9 Marius Vlad 2016-04-21 16:26:56 UTC
Created attachment 123129 [details]
dmesg with 0xff debug
Comment 10 Chris Wilson 2016-04-22 10:27:28 UTC
Ok, the full-ppgtt (BDW/SKL) hang is definitely a test bug (handles being incorrectly used on a second fd). Not sure how that propagates into the corruption and sigsegv, but it's probably related. Equally not sure how this then becomes a hard hang on SNB.
Comment 11 Marius Vlad 2016-04-22 11:14:19 UTC
Side-note, used legacy contexts on BDW and its exhibiting same behaviour.
Comment 12 Chris Wilson 2016-04-22 15:06:09 UTC
commit 757b9be460e06c8466f6c49ab7f0d7ff234b5b54
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Apr 22 16:02:12 2016 +0100

    igt/gem_close_race: Avoid using threads, use signals instead
    
    Emulate the behaviour of the second thread killing fd at random by
    having a signal fire at a random time instead. Only one thread and so we
    do not have the issue of accessing another valid handle on another fd
    and so executing a blank buffer - triggering GPU hangs.
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=95048
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>

Please test and confirm that we are not killing machines...
Comment 13 Chris Wilson 2016-04-22 15:06:24 UTC
(Or better yet that we are only killing snb!)
Comment 14 Marius Vlad 2016-04-22 17:27:46 UTC
Yes, on SNB the test succeeds. Will test on BDW and SKL to make sure.
Comment 15 Marius Vlad 2016-04-25 09:52:07 UTC
No longer crashes on BDW and SKL with:

commit 757b9be460e06c8466f6c49ab7f0d7ff234b5b54
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Apr 22 16:02:12 2016 +0100

    igt/gem_close_race: Avoid using threads, use signals instead
Comment 16 Chris Wilson 2016-04-25 09:58:50 UTC
commit 14f7959038c6a79a3a409c420f33d00902497daa
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Apr 25 10:56:37 2016 +0100

    igt/gem_close_race: Restore threads test to BAT status
    
    Let's try it again because it would have caught a bug in a patch I sent
    to the ml...
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=95048
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Comment 17 Jari Tahvanainen 2017-07-03 10:47:03 UTC
Closing >1 year old resolved+fixed.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.