Bug 93091

Summary: [opencl] segfault when running any opencl programs (like clinfo)
Product: Mesa Reporter: Paulo Dias <paulo.miguel.dias>
Component: OtherAssignee: mesa-dev
Status: RESOLVED FIXED QA Contact: mesa-dev
Severity: normal    
Priority: medium    
Version: git   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments: output of "valgrind clinfo"
gdb output and backtrace for clinfo
attachment-27601-0.html

Description Paulo Dias 2015-11-24 07:25:51 UTC
the title says it all, clinfo and other opencl programs segfaults

last time i tested, like 2 weeks ago it was working fine.

using llvm 3.8 git, mesa git, radeonsi git all from today, xrog 1.17, kubuntu 15.10 + padoka ppa 
---
groo@hydra:~$ gdb clinfo
GNU gdb (Ubuntu 7.10-1ubuntu2) 7.10
Copyright (C) 2015 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from clinfo...(no debugging symbols found)...done.                                                                                                                                                                                 
(gdb) run                                                                                                                                                                                                                                          
Starting program: /usr/bin/clinfo                                                                                                                                                                                                                  
[Thread debugging using libthread_db enabled]                                                                                                                                                                                                      
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".                                                                                                                                                                         
[New Thread 0x7fffec786700 (LWP 3683)]                                                                                                                                                                                                             
                                                                                                                                                                                                                                                   
Program received signal SIGSEGV, Segmentation fault.                                                                                                                                                                                               
0x00007ffff09c26c1 in ?? () from /usr/lib/x86_64-linux-gnu/libMesaOpenCL.so.1                                                                                                                                                                      
(gdb) bt                                                                                                                                                                                                                                           
#0  0x00007ffff09c26c1 in ?? () from /usr/lib/x86_64-linux-gnu/libMesaOpenCL.so.1                                                                                                                                                                  
#1  0x00007ffff09eb610 in ?? () from /usr/lib/x86_64-linux-gnu/libMesaOpenCL.so.1                                                                                                                                                                  
#2  0x00007ffff09f7864 in ?? () from /usr/lib/x86_64-linux-gnu/libMesaOpenCL.so.1                                                                                                                                                                  
#3  0x00007ffff09c2276 in ?? () from /usr/lib/x86_64-linux-gnu/libMesaOpenCL.so.1                                                                                                                                                                  
#4  0x00007ffff7de95ba in call_init (l=<optimized out>, argc=argc@entry=1, argv=argv@entry=0x7fffffffdec8, env=env@entry=0x7fffffffded8) at dl-init.c:72                                                                                           
#5  0x00007ffff7de96cb in call_init (env=<optimized out>, argv=<optimized out>, argc=<optimized out>, l=<optimized out>) at dl-init.c:30                                                                                                           
#6  _dl_init (main_map=main_map@entry=0x67ffd0, argc=1, argv=0x7fffffffdec8, env=0x7fffffffded8) at dl-init.c:120                                                                                                                                  
#7  0x00007ffff7dee587 in dl_open_worker (a=a@entry=0x7fffffffd9b8) at dl-open.c:579                                                                                                                                                               
#8  0x00007ffff7de9464 in _dl_catch_error (objname=objname@entry=0x7fffffffd9a8, errstring=errstring@entry=0x7fffffffd9b0, mallocedp=mallocedp@entry=0x7fffffffd9a7, operate=operate@entry=0x7ffff7dee0a0 <dl_open_worker>,                        
    args=args@entry=0x7fffffffd9b8) at dl-error.c:187                                                                                                                                                                                              
#9  0x00007ffff7ded9a3 in _dl_open (file=0x68bd10 "libMesaOpenCL.so.1", mode=-2147483647, caller_dlopen=0x7ffff7bd1dea, nsid=-2, argc=<optimized out>, argv=<optimized out>, env=0x7fffffffded8) at dl-open.c:663
#10 0x00007ffff7600fc9 in dlopen_doit (a=a@entry=0x7fffffffdbd0) at dlopen.c:66
#11 0x00007ffff7de9464 in _dl_catch_error (objname=0x7ffff78030f0 <last_result+16>, errstring=0x7ffff78030f8 <last_result+24>, mallocedp=0x7ffff78030e8 <last_result+8>, operate=0x7ffff7600f70 <dlopen_doit>, args=0x7fffffffdbd0)
    at dl-error.c:187
#12 0x00007ffff760162d in _dlerror_run (operate=operate@entry=0x7ffff7600f70 <dlopen_doit>, args=args@entry=0x7fffffffdbd0) at dlerror.c:163
#13 0x00007ffff7601061 in __dlopen (file=<optimized out>, mode=<optimized out>) at dlopen.c:87
#14 0x00007ffff7bd1dea in ?? () from /usr/lib/x86_64-linux-gnu/libOpenCL.so.1
#15 0x00007ffff7bd1f40 in ?? () from /usr/lib/x86_64-linux-gnu/libOpenCL.so.1
#16 0x00007ffff7bd24d9 in ?? () from /usr/lib/x86_64-linux-gnu/libOpenCL.so.1
#17 0x00007ffff7bd2d0b in clGetPlatformIDs () from /usr/lib/x86_64-linux-gnu/libOpenCL.so.1
#18 0x0000000000401328 in ?? ()
#19 0x00007ffff7824a40 in __libc_start_main (main=0x401170, argc=1, argv=0x7fffffffdec8, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffdeb8) at libc-start.c:289
#20 0x00000000004015a9 in ?? ()
(gdb) quit
A debugging session is active.

        Inferior 1 [process 3678] will be killed.

----
groo@hydra:~/devel/opencl/tools-master$ gdb --args ./cl-demo 1000000 10
GNU gdb (Ubuntu 7.10-1ubuntu2) 7.10
Copyright (C) 2015 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./cl-demo...(no debugging symbols found)...done.
(gdb) run
Starting program: /home/groo/devel/opencl/tools-master/cl-demo 1000000 10
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7fffec786700 (LWP 3934)]

Program received signal SIGSEGV, Segmentation fault.
0x00007ffff07ba6c1 in ?? () from /usr/lib/x86_64-linux-gnu/libMesaOpenCL.so.1
(gdb) bt
#0  0x00007ffff07ba6c1 in ?? () from /usr/lib/x86_64-linux-gnu/libMesaOpenCL.so.1
#1  0x00007ffff07e3610 in ?? () from /usr/lib/x86_64-linux-gnu/libMesaOpenCL.so.1
#2  0x00007ffff07ef864 in ?? () from /usr/lib/x86_64-linux-gnu/libMesaOpenCL.so.1
#3  0x00007ffff07ba276 in ?? () from /usr/lib/x86_64-linux-gnu/libMesaOpenCL.so.1
#4  0x00007ffff7de95ba in call_init (l=<optimized out>, argc=argc@entry=3, argv=argv@entry=0x7fffffffde58, env=env@entry=0x7fffffffde78) at dl-init.c:72
#5  0x00007ffff7de96cb in call_init (env=<optimized out>, argv=<optimized out>, argc=<optimized out>, l=<optimized out>) at dl-init.c:30
#6  _dl_init (main_map=main_map@entry=0x675760, argc=3, argv=0x7fffffffde58, env=0x7fffffffde78) at dl-init.c:120
#7  0x00007ffff7dee587 in dl_open_worker (a=a@entry=0x7fffffffc798) at dl-open.c:579
#8  0x00007ffff7de9464 in _dl_catch_error (objname=objname@entry=0x7fffffffc788, errstring=errstring@entry=0x7fffffffc790, mallocedp=mallocedp@entry=0x7fffffffc787, operate=operate@entry=0x7ffff7dee0a0 <dl_open_worker>, 
    args=args@entry=0x7fffffffc798) at dl-error.c:187
#9  0x00007ffff7ded9a3 in _dl_open (file=0x675740 "libMesaOpenCL.so.1", mode=-2147483647, caller_dlopen=0x7ffff79c9dea, nsid=-2, argc=<optimized out>, argv=<optimized out>, env=0x7fffffffde78) at dl-open.c:663
#10 0x00007ffff71dafc9 in dlopen_doit (a=a@entry=0x7fffffffc9b0) at dlopen.c:66
#11 0x00007ffff7de9464 in _dl_catch_error (objname=0x60f0e0, errstring=0x60f0e8, mallocedp=0x60f0d8, operate=0x7ffff71daf70 <dlopen_doit>, args=0x7fffffffc9b0) at dl-error.c:187
#12 0x00007ffff71db62d in _dlerror_run (operate=operate@entry=0x7ffff71daf70 <dlopen_doit>, args=args@entry=0x7fffffffc9b0) at dlerror.c:163
#13 0x00007ffff71db061 in __dlopen (file=<optimized out>, mode=<optimized out>) at dlopen.c:87
#14 0x00007ffff79c9dea in ?? () from /usr/lib/x86_64-linux-gnu/libOpenCL.so.1
#15 0x00007ffff79c9f40 in ?? () from /usr/lib/x86_64-linux-gnu/libOpenCL.so.1
#16 0x00007ffff79ca4d9 in ?? () from /usr/lib/x86_64-linux-gnu/libOpenCL.so.1
#17 0x00007ffff79cad0b in clGetPlatformIDs () from /usr/lib/x86_64-linux-gnu/libOpenCL.so.1
#18 0x0000000000402a3f in create_context_on ()
#19 0x00000000004016bd in main ()
Comment 1 Emil Velikov 2015-11-24 11:04:55 UTC
Without debug symbols, the backtrace isn't that useful. Before you install the extra dbg packages or anything else try this patch http://patchwork.freedesktop.org/patch/65914/
Comment 2 Dieter Nützel 2015-11-24 13:07:08 UTC
(In reply to Emil Velikov from comment #1)
> Without debug symbols, the backtrace isn't that useful. Before you install
> the extra dbg packages or anything else try this patch
> http://patchwork.freedesktop.org/patch/65914/

As it is
Tested-by: Tom Stellard <thomas.stellard@amd.com>
and
I tried it, too. Soo...
Tested-by: Dieter Nützel <Dieter@nuetzel-hh.de>

It should go in?
Comment 3 Emil Velikov 2015-11-24 18:52:14 UTC
(In reply to Dieter Nützel from comment #2)
> (In reply to Emil Velikov from comment #1)
> > Without debug symbols, the backtrace isn't that useful. Before you install
> > the extra dbg packages or anything else try this patch
> > http://patchwork.freedesktop.org/patch/65914/
> 
> As it is
> Tested-by: Tom Stellard <thomas.stellard@amd.com>
> and
> I tried it, too. Soo...
> Tested-by: Dieter Nützel <Dieter@nuetzel-hh.de>
> 
> It should go in?
Indeed it should. Bth I'm not too excited pushing patches if no one has skimmed through (just like the large series that caused this regression).

Either way I'll push it within an hour or two, barring any objection.
Comment 4 Aaron Watry 2015-11-24 22:32:41 UTC
Hmm. I'm still getting a segfault with the proposed patch.

Valgrind and gdb output in a second.
Comment 5 Aaron Watry 2015-11-24 22:33:23 UTC
Created attachment 120097 [details]
output of "valgrind clinfo"
Comment 6 Aaron Watry 2015-11-24 22:34:07 UTC
Created attachment 120098 [details]
gdb output and backtrace for clinfo
Comment 7 Aaron Watry 2015-11-25 16:49:52 UTC
Bah, ignore me.

I could still reproduce the issue yesterday to the best of my knowledge, but after an llvm/mesa rebuild with the patch applied this morning, things are working correctly...
Comment 8 Samuel Pitoiset 2015-11-26 08:49:28 UTC
This segfault still happens for me using mesa git, commit ca976e6900dc8ff457ed9dba661d037c616abc59.

OpenGL renderer string: Gallium 0.4 on NVE7
OpenGL core profile version string: 4.1 (Core Profile) Mesa 11.2.0-devel (git-ca976e6)

#0  pipe_loader_create_screen (dev=0x0) at pipe_loader.c:79
#1  0x00007ffff6177b0a in clover::device::device (this=0x641cd0, platform=..., ldev=<optimized out>) at core/device.cpp:44
#2  0x00007ffff6183226 in clover::create<clover::device, clover::platform&, pipe_loader_device*&> () at ./util/pointer.hpp:230
#3  clover::platform::platform (this=0x7ffff7dc5520 <(anonymous namespace)::_clover_platform>) at core/platform.cpp:35
#4  0x00007ffff612e196 in __static_initialization_and_destruction_0 (__initialize_p=1, __priority=65535) at api/platform.cpp:29
#5  _GLOBAL__sub_I_platform.cpp(void) () at api/platform.cpp:120
#6  0x00007ffff7dea27a in call_init.part () from /lib64/ld-linux-x86-64.so.2
#7  0x00007ffff7dea38b in _dl_init () from /lib64/ld-linux-x86-64.so.2
#8  0x00007ffff7ddbdba in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
#9  0x0000000000000001 in ?? ()
#10 0x00007fffffffe356 in ?? ()
#11 0x0000000000000000 in ?? ()
Comment 9 Emil Velikov 2015-11-26 13:02:38 UTC
(In reply to Samuel Pitoiset from comment #8)
> This segfault still happens for me using mesa git, commit
> ca976e6900dc8ff457ed9dba661d037c616abc59.
> 
> OpenGL renderer string: Gallium 0.4 on NVE7
> OpenGL core profile version string: 4.1 (Core Profile) Mesa 11.2.0-devel
> (git-ca976e6)
> 
> #0  pipe_loader_create_screen (dev=0x0) at pipe_loader.c:79

Strange... I wonder if we're getting hit by the issue pointed out to Tom - namely we should cap the ldevs iteration in platform::platform() to the value returned by the second call to pipe_loader_probe. Reason being that second call to pipe_loader_probe() may return smaller count (device has gone missing, enomem, other).

Alternatively can someone pin-point the commit that causes this (note you might need to cherry-pick patch in comment 1 to fixup commit ff9cd8a67ca).

Thanks
Comment 10 Paulo Dias 2015-11-26 14:30:34 UTC
Created attachment 120146 [details]
attachment-27601-0.html

well FWIW it fixed it for me, no more crashes and clinfo and tests works.
might be hw specific or a different bug.
On Nov 26, 2015 11:02, <bugzilla-daemon@freedesktop.org> wrote:

> *Comment # 9 <https://bugs.freedesktop.org/show_bug.cgi?id=93091#c9> on
> bug 93091 <https://bugs.freedesktop.org/show_bug.cgi?id=93091> from Emil
> Velikov <emil.l.velikov@gmail.com> *
>
> (In reply to Samuel Pitoiset from comment #8 <https://bugs.freedesktop.org/show_bug.cgi?id=93091#c8>)> This segfault still happens for me using mesa git, commit
> > ca976e6900dc8ff457ed9dba661d037c616abc59.
> >
> > OpenGL renderer string: Gallium 0.4 on NVE7
> > OpenGL core profile version string: 4.1 (Core Profile) Mesa 11.2.0-devel
> > (git-ca976e6)
> >
> > #0  pipe_loader_create_screen (dev=0x0) at pipe_loader.c:79
>
> Strange... I wonder if we're getting hit by the issue pointed out to Tom -
> namely we should cap the ldevs iteration in platform::platform() to the value
> returned by the second call to pipe_loader_probe. Reason being that second call
> to pipe_loader_probe() may return smaller count (device has gone missing,
> enomem, other).
>
> Alternatively can someone pin-point the commit that causes this (note you might
> need to cherry-pick patch in comment 1 <https://bugs.freedesktop.org/show_bug.cgi?id=93091#c1> to fixup commit ff9cd8a67ca).
>
> Thanks
>
> ------------------------------
> You are receiving this mail because:
>
>    - You reported the bug.
>
>
Comment 11 Emil Velikov 2015-11-26 14:57:21 UTC
As per comment 10, I'm closing this bug.

Samuel, can you please open another bug (or catch me on IRC) with the requested information ? If you have some changes on top of master, please point me to a branch where I can skim through. Thanks

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.