Created attachment 123112 [details] gpu hang skl 4405Y tests/gem_close_race, basic-threads hangs the machine on SNB, and crashes on BDW and SKL w/ SIGSEGV. Managed to get a GPU hang from another SLK.
Created attachment 123113 [details] skl-4405y dmesg
Created attachment 123114 [details] bdw-i7-5600u dmesg
[New LWP 6945] [New LWP 6943] [New LWP 6944] Core was generated by `/opt/igt/tests/gem_close_race --run-subtest basic-threads'. Program terminated with signal SIGSEGV, Segmentation fault. #0 0x00007ffc4b93e440 in ?? () [Current thread is 1 (LWP 6945)] (gdb) file gem_close_race warning: exec file is newer than core file. Reading symbols from gem_close_race...done. (gdb) thread apply all bt full Thread 3 (LWP 6944): #0 0x00007f80db1bd0b7 in ?? () No symbol table info available. #1 0x00007f80dbac5908 in ?? () No symbol table info available. #2 0x00007f80d9afde40 in ?? () No symbol table info available. #3 0x000000000000005d in ?? () No symbol table info available. #4 0x0000000000000030 in ?? () No symbol table info available. #5 0x0000000000000005 in ?? () No symbol table info available. #6 0x00007f80d9afdd60 in ?? () No symbol table info available. #7 0x00000000004044c0 in selfcopy (fd=1077961833, fd@entry=5, handle=<optimized out>, loops=92, loops@entry=100) at gem_close_race.c:117 reloc = {{target_handle = 1, delta = 0, offset = 16, presumed_offset = 4294701056, read_domains = 2, write_domain = 2}, {target_handle = 1, delta = 0, offset = 32, presumed_offset = 4294701056, read_domains = 2, write_domain = 0}} gem_exec = {{handle = 1, relocation_count = 0, relocs_ptr = 0, alignment = 0, offset = 4294701056, flags = 0, rsvd1 = 0, rsvd2 = 0}, {handle = 2, relocation_count = 2, relocs_ptr = 140191384722896, alignment = 0, offset = 4294696960, flags = 0, rsvd1 = 0, rsvd2 = 0}} execbuf = {buffers_ptr = 140191384722960, buffer_count = 2, batch_start_offset = 0, batch_len = 48, DR1 = 0, DR4 = 0, num_cliprects = 0, cliprects_ptr = 0, flags = 3, rsvd1 = 0, rsvd2 = 0} gem_pwrite = {handle = 2, pad = 0, offset = 0, size = 48, data_ptr = 140191384722848} create = {size = 4096, handle = 2, pad = 0} buf = {1425014792, 63705088, 0, 66560, 0, 0, 33554432, 4096, 0, 0, 83886080, 0} b = <optimized out> #8 0x0000000000404ab8 in thread_run (_data=0x233c030) at gem_close_race.c:174 fd = 5 n = 6 handle = 1 arg = {name = 1, handle = 1, size = 262144} #9 0x00007f80daca16aa in ?? () No symbol table info available. #10 0x0000000000000000 in ?? () No symbol table info available. Thread 2 (LWP 6943): #0 0x00007f80db0f53f1 in ?? () No symbol table info available. #1 0x0000000000000000 in ?? () No symbol table info available. Thread 1 (LWP 6945): #0 0x00007ffc4b93e440 in ?? () No symbol table info available. #1 0x0000000000000000 in ?? () No symbol table info available.
Created attachment 123118 [details] core file
Created attachment 123119 [details] gem_close_race binary to be used w/ core
Off the top of my head: i915.enable_ppgtt=2 (i.e. disable 48bit ppgtt) then i915.enable_ppgtt=1 ? Do they make the SIGGEV and gpu hang go away on bdw/skl?
No, same thing.
Last one in terms of options: i915.enable_execlists=0 That should be the difference between bdw/skl and the rest... I hope.
Created attachment 123129 [details] dmesg with 0xff debug
Ok, the full-ppgtt (BDW/SKL) hang is definitely a test bug (handles being incorrectly used on a second fd). Not sure how that propagates into the corruption and sigsegv, but it's probably related. Equally not sure how this then becomes a hard hang on SNB.
Side-note, used legacy contexts on BDW and its exhibiting same behaviour.
commit 757b9be460e06c8466f6c49ab7f0d7ff234b5b54 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri Apr 22 16:02:12 2016 +0100 igt/gem_close_race: Avoid using threads, use signals instead Emulate the behaviour of the second thread killing fd at random by having a signal fire at a random time instead. Only one thread and so we do not have the issue of accessing another valid handle on another fd and so executing a blank buffer - triggering GPU hangs. References: https://bugs.freedesktop.org/show_bug.cgi?id=95048 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Please test and confirm that we are not killing machines...
(Or better yet that we are only killing snb!)
Yes, on SNB the test succeeds. Will test on BDW and SKL to make sure.
No longer crashes on BDW and SKL with: commit 757b9be460e06c8466f6c49ab7f0d7ff234b5b54 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri Apr 22 16:02:12 2016 +0100 igt/gem_close_race: Avoid using threads, use signals instead
commit 14f7959038c6a79a3a409c420f33d00902497daa Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Mon Apr 25 10:56:37 2016 +0100 igt/gem_close_race: Restore threads test to BAT status Let's try it again because it would have caught a bug in a patch I sent to the ml... References: https://bugs.freedesktop.org/show_bug.cgi?id=95048 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Closing >1 year old resolved+fixed.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.