Bug 98169 - lm_sensors hud option crashes unigine heaven
Summary: lm_sensors hud option crashes unigine heaven
Status: RESOLVED FIXED
Alias: None
Product: Mesa
Classification: Unclassified
Component: Mesa core (show other bugs)
Version: git
Hardware: Other All
: medium normal
Assignee: Steven Toth
QA Contact: mesa-dev
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-10-09 12:10 UTC by Christoph Haag
Modified: 2016-12-25 00:11 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments
valgrind output (8.90 KB, text/plain)
2016-10-09 12:10 UTC, Christoph Haag
Details
Don't destroy sensors context when its still inuse (640 bytes, patch)
2016-10-11 15:50 UTC, Steven Toth
Details | Splinter Review
gallium hud showing 0 for cpu freq and sensors (2.23 MB, image/png)
2016-10-11 16:21 UTC, Christoph Haag
Details
Bugfix: we're prematurely freeing a HUB object (2.59 KB, patch)
2016-10-11 17:39 UTC, Steven Toth
Details | Splinter Review
bt full (5.50 KB, text/plain)
2016-10-11 17:51 UTC, Christoph Haag
Details
bt full with sensors (3.01 KB, text/plain)
2016-10-11 18:16 UTC, Christoph Haag
Details
Betetr version of the sensor_cleanup patch (1.27 KB, patch)
2016-10-12 14:50 UTC, Steven Toth
Details | Splinter Review
success (2.88 MB, image/png)
2016-10-12 15:49 UTC, Christoph Haag
Details
Switch to internal reference counting and mutexing. (13.95 KB, patch)
2016-10-13 14:40 UTC, Steven Toth
Details | Splinter Review

Description Christoph Haag 2016-10-09 12:10:43 UTC
Created attachment 127151 [details]
valgrind output

For convenient debugging, go to unigine-heaven/bin and run

I'm using GALLIUM_HUD=sensors_temp_cu-amdgpu-pci-0100.temp1 for this.
glxgears works fine with it.
ungine-heaven crashes (I have not tested more applications).

For convenient debugging, go to unigine-heaven/bin and run

./heaven_x64 -project_name Heaven -data_path ../ -engine_config ../data/heaven_4.0.cfg -system_script heaven/unigine.cpp -sound_app openal -video_app opengl -video_multisample 1 -video_fullscreen 0 -video_mode 6 -extern_define ,RELEASE,LANGUAGE_EN,QUALITY_HIGH,TESSELLATION_NORMAL -extern_plugin ,GPUMonitor

after printing
Set 1920x1080 windowed video mode
it crashes
*** Error in `./heaven_x64': corrupted double-linked list: 0x0000000001d7b040 ***

valgrind detects an error here:

==13792== Invalid free() / delete / delete[] / realloc()
==13792==    at 0x4C2AD90: free (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==13792==    by 0x2A0C4512: sensors_cleanup (in /usr/lib/libsensors.so.4.4.0)
==13792==    by 0x294B0606: free_query_data (hud_sensors_temp.c:200)
==13792==    by 0x294AB4D5: hud_graph_destroy (hud_context.c:863)
==13792==    by 0x294AD145: hud_destroy (hud_context.c:1528)
==13792==    by 0x29452BA0: dri_destroy_context (dri_context.c:176)
==13792==    by 0x29451BCC: driDestroyContext (dri_util.c:500)
==13792==    by 0x1B4FA333: dri3_destroy_context (dri3_glx.c:200)

More complete valgrind output attached.
Comment 1 Christoph Haag 2016-10-09 14:20:58 UTC
I also noticed that the cpufrequency graphs (cpufreq-cur-cpu0) always show 0 in unigine heaven.

They work correctly in glxgears, so perhaps it's caused by the same thing unigine heaven does.
Comment 2 Steven Toth 2016-10-10 15:18:25 UTC
Tests OK on glxgears, glxdemo and glxheads, on my platform (x86_64 Ubuntu 16.04), with NVIDIA GTX 950.

I'll look at installing unigine.
Comment 3 Steven Toth 2016-10-10 15:57:44 UTC
I'm not familiar with Unigine and I'm having issues trying to get it to run with mesa tip. Your feedback is welcome.

Testing with a Heaven-Unigine 4.0 binary downloaded today.

Compiling/installing mesa tip to a non-system specific dir, with --prefix, eg: /home/stoth/mesa-tip/root

./configure --with-gallium-drivers=nouveau --prefix=$PWD/root \
	--enable-lmsensors=yes --enable-gallium-extra-hud=yes

I include $PWD/root/lib dir in my LD_LIBRARY_PATH before running any application binaries, including glxgears etc. All works as expected with glxhears/heads/demo, no segfaults, crashes or other unwanted behavior.

Running heaven with your provided command line, I see an error .

Loading "/home/stoth/.Heaven/heaven_4.0.cfg"...
Loading "libGPUMonitor_x86.so"...
Loading "libGL.so.1"...
Loading "libopenal.so.1"...
Set 1280x720 windowed video mode
X Error of failed request:  GLXBadFBConfig
  Major opcode of failed request:  155 (GLX)
  Minor opcode of failed request:  34 ()
  Serial number of failed request:  58
  Current serial number in output stream:  57
AL lib: (EE) alc_cleanup: 1 device not closed

I see the same error without any command line options, suggesting a general GL initialization error.

Suggestions welcome.

I misspoke earlier, I'm on Ubuntu 16.04 32bit.
Comment 4 Christoph Haag 2016-10-10 16:08:53 UTC
Only thing I can think of is that you may need LIBGL_DRIVERS_PATH.

Try
LIBGL_DRIVERS_PATH=$PWD/root/lib/xorg/modules/dri/
or something like that.

also LIBGL_DEBUG=verbose to check whether something goes wrong while loading the driver.
Comment 5 Steven Toth 2016-10-11 11:11:52 UTC
Grr. I still can't make it run against mesa tip.

+ ./heaven -project_name Heaven -data_path ../ -engine_config ../data/heaven_4.0.cfg -system_script heaven/unigine.cpp -sound_app openal -video_app opengl -video_multisample 1 -video_fullscreen 0 -video_mode 6 -extern_define ,RELEASE,LANGUAGE_EN,QUALITY_HIGH,TESSELLATION_NORMAL -extern_plugin ,GPUMonitor
+ export LD_LIBRARY_PATH=/home/stoth/ZODIAC-16.04/mesa-tip/root/lib/:home/stoth/ZODIAC-16.04/mesa-tip/root/lib/dri
+ LD_LIBRARY_PATH=/home/stoth/ZODIAC-16.04/mesa-tip/root/lib/:home/stoth/ZODIAC-16.04/mesa-tip/root/lib/dri
+ export LD_LIBRARY_PATH=./x86:/home/stoth/ZODIAC-16.04/mesa-tip/root/lib/:home/stoth/ZODIAC-16.04/mesa-tip/root/lib/dri
+ LD_LIBRARY_PATH=./x86:/home/stoth/ZODIAC-16.04/mesa-tip/root/lib/:home/stoth/ZODIAC-16.04/mesa-tip/root/lib/dri
+ ./browser_x86 -config ../data/launcher/launcher.xml
libGL: Can't open configuration file /home/stoth/.drirc: No such file or directory.
libGL: pci id for fd 12: 10de:1402, driver nouveau
libGL: OpenDriver: trying /home/stoth/ZODIAC-16.04/mesa-tip/root/lib/dri/nouveau_dri.so
libGL: Can't open configuration file /home/stoth/.drirc: No such file or directory.
libGL: Can't open configuration file /home/stoth/.drirc: No such file or directory.
libGL: Using DRI3 for screen 0
X Error of failed request:  GLXBadFBConfig
  Major opcode of failed request:  155 (GLX)
  Minor opcode of failed request:  34 ()
  Serial number of failed request:  58
  Current serial number in output stream:  57
Loading "/home/stoth/.Heaven/heaven_4.0.cfg"...
Loading "libGPUMonitor_x86.so"...
Loading "libGL.so.1"...
Loading "libopenal.so.1"...
Set 1280x720 windowed video mode
AL lib: (EE) alc_cleanup: 1 device not closed

I also looked at library loading generally with LD_DEBUG=libs and saw no obvious failures to load

Also, copying /etc/drirc into $HOME/.drrc had no effect.
Comment 6 Christoph Haag 2016-10-11 11:34:50 UTC
Might be a nouveau issue on Maxwell, no idea.

A google search says that this means that the driver does not advertise GLX_ARB_create_context  as supported: https://www.opengl.org/registry/specs/ARB/glx_create_context.txt
and that you need OpenGL 3.2 for this extension. Does glxinfo | grep OpenGL print something with OpenGL core profile 3.2 or newer?

Since I can reproduce the X error with MESA_GL_VERSION_OVERRIDE=3.1, maybe try setting
MESA_GL_VERSION_OVERRIDE=3.2 MESA_GLSL_VERSION_OVERRIDE=150
and see what happens...
Comment 7 Eero Tamminen 2016-10-11 11:39:34 UTC
Does the crashing happen if you disable Heaven tessellation (TESSELLATION_DISABLED option)?

One of the causes for FBConfig issues is X server using mismatching GL driver than the application you're testing.  Did you try system install of Mesa?
Comment 8 Steven Toth 2016-10-11 12:18:04 UTC
+ glxinfo
+ grep -i opengl
libGL: pci id for fd 4: 10de:1402, driver nouveau
libGL: OpenDriver: trying /home/stoth/<redacted>/mesa-tip/root/lib/dri/nouveau_dri.so
libGL: Using DRI3 for screen 0
OpenGL vendor string: nouveau
OpenGL renderer string: Gallium 0.4 on NV126
OpenGL version string: 2.1 Mesa 12.1.0-devel (git-fc8b358)
OpenGL shading language version string: 1.30
OpenGL extensions:
OpenGL ES profile version string: OpenGL ES 2.0 Mesa 12.1.0-devel (git-fc8b358)
OpenGL ES profile shading language version string: OpenGL ES GLSL ES 1.0.16
OpenGL ES profile extensions:
+ exit 0
Comment 9 Steven Toth 2016-10-11 12:21:32 UTC
Extended renderer info (GLX_MESA_query_renderer):
    Vendor: nouveau (0x10de)
    Device: NV126 (0x1402)
    Version: 12.1.0
    Accelerated: yes
    Video memory: 2031MB
    Unified memory: no
    Preferred profile: compat (0x2)
    Max core profile version: 0.0
    Max compat profile version: 2.1
    Max GLES1 profile version: 1.1
    Max GLES[23] profile version: 2.0

^^^ Might be totally unrelated, core profile reporting as 0.0
Comment 10 Ilia Mirkin 2016-10-11 12:36:44 UTC
(In reply to Steven Toth from comment #8)
> + glxinfo
> + grep -i opengl
> libGL: pci id for fd 4: 10de:1402, driver nouveau
> libGL: OpenDriver: trying
> /home/stoth/<redacted>/mesa-tip/root/lib/dri/nouveau_dri.so
> libGL: Using DRI3 for screen 0
> OpenGL vendor string: nouveau
> OpenGL renderer string: Gallium 0.4 on NV126
> OpenGL version string: 2.1 Mesa 12.1.0-devel (git-fc8b358)

I'm sure Heaven wants at least GL 3.0. You probably want to build mesa with --enable-texture-float
Comment 11 Steven Toth 2016-10-11 15:23:25 UTC
enable-texture-float=yes

^^^ This fixes the runtime issue, thanks.

I can now repro the original sensors related issue. Investigating.
Comment 12 Steven Toth 2016-10-11 15:50:06 UTC
Created attachment 127217 [details] [review]
Don't destroy sensors context when its still inuse
Comment 13 Steven Toth 2016-10-11 15:53:55 UTC
Christoph,

I've been able to reproduce the problem. Your use case has been exercising a code path that glxgears/heads/demo does not, with regards to HUD free/destruction. glxgears termination never routes through the free_query_data() function.

As a result we've been destroying the sensors library context way too early.

Can you please confirm the attached patch resolves the origin crash you're seeing?

Thanks.
Comment 14 Christoph Haag 2016-10-11 16:21:48 UTC
Created attachment 127218 [details]
gallium hud showing 0 for cpu freq and sensors

Thanks for the patch. I can confirm that unigine-heaven starts now with the sensors option enabled.


I don't think there's another bug report needed for this as it's probably a similar issue:
Attached is a screenshot that shows the hud with cpu frequency and the lm_sensors temperature being displayed as 0. This does not happen with glxgears.
Comment 15 Alex Deucher 2016-10-11 16:34:58 UTC
(In reply to Christoph Haag from comment #14)
> I don't think there's another bug report needed for this as it's probably a
> similar issue:
> Attached is a screenshot that shows the hud with cpu frequency and the
> lm_sensors temperature being displayed as 0. This does not happen with
> glxgears.

Make sure your 32-bit 3d driver is up to date if the app is 32 bit and your distro is 64-bit.
Comment 16 Steven Toth 2016-10-11 16:37:58 UTC
That's pretty bizarre, I don't see any of the cpu frequency, disk io or sensor charts with any data.

This isn't the same bug as the sensor bug, it's something new.

Can you please share your exact ./configure options for mesa? I still have issues running unigine, it starts, I see the render window appear and the app segfaults (unrelated to sensors) so my mesa build still isn't configured correctly - or well enough for unigine to operate.

thx
Comment 17 Andy Furniss 2016-10-11 16:46:42 UTC
(In reply to Steven Toth from comment #16)

> Can you please share your exact ./configure options for mesa? I still have
> issues running unigine, it starts, I see the render window appear and the
> app segfaults (unrelated to sensors) so my mesa build still isn't configured
> correctly - or well enough for unigine to operate.
> 
> thx

--sysconfdir=/etc is important for unigine as there are workarounds that need to live in /etc/drirc also if you have .drirc in $HOME that may mess things up(untested).
Comment 18 Ilia Mirkin 2016-10-11 17:13:30 UTC
(In reply to Andy Furniss from comment #17)
> (In reply to Steven Toth from comment #16)
> 
> > Can you please share your exact ./configure options for mesa? I still have
> > issues running unigine, it starts, I see the render window appear and the
> > app segfaults (unrelated to sensors) so my mesa build still isn't configured
> > correctly - or well enough for unigine to operate.
> > 
> > thx
> 
> --sysconfdir=/etc is important for unigine as there are workarounds that
> need to live in /etc/drirc also if you have .drirc in $HOME that may mess
> things up(untested).

Actually it's important to actually "make install" and *not* use LIBGL_DRIVERS_PATH for precisely this reason.
Comment 19 Steven Toth 2016-10-11 17:39:11 UTC
Created attachment 127222 [details] [review]
Bugfix: we're prematurely freeing a HUB object

Allow multiple open uses of the HUD objects, don't assuming we can destroy an object after one reference free's it.
Comment 20 Steven Toth 2016-10-11 17:40:05 UTC
Christoph,

The attached referencecount.patch should work very nicely for you.

Can you please test and report back.

I have unigine workere here now, I was able to reproduce the issue and I belive this patch (in addition to the prio sensor patch) should resolve your open issue.
Comment 21 Christoph Haag 2016-10-11 17:51:34 UTC
Created attachment 127224 [details]
bt full

(In reply to Steven Toth from comment #20)
> Christoph,
> 
> The attached referencecount.patch should work very nicely for you.

Unfortunately, it introduces a new segmentation fault in libsensors:

#0  0x00007fffcecd31a4 in ?? () from /usr/lib/libsensors.so.4
#1  0x00007fffcecd3334 in ?? () from /usr/lib/libsensors.so.4
#2  0x00007fffcecd3dc0 in sensors_get_subfeature () from /usr/lib/libsensors.so.4
#3  0x00007fffcf6267ba in get_sensor_values (sti=0x85d9c0) at hud/hud_sensors_temp.c:115
#4  0x00007fffcf626a15 in query_sti_load (gr=0x8f4710) at hud/hud_sensors_temp.c:162
#5  0x00007fffcf620800 in hud_draw (hud=0x8f1e10, tex=0x879610) at hud/hud_context.c:577
#6  0x00007fffcf5ca572 in dri_flush (cPriv=0x85f780, dPriv=0x867550, flags=3, reason=__DRI2_THROTTLE_SWAPBUFFER) at dri_drawable.c:488
#7  0x00007fffe195cdbf in loader_dri3_flush (draw=0x866da8, flags=3, throttle_reason=__DRI2_THROTTLE_SWAPBUFFER) at loader_dri3_helper.c:458
#8  0x00007fffe1955166 in glx_dri3_flush_drawable (draw=0x866da8, flags=3) at dri3_glx.c:148
#9  0x00007fffe195d5a0 in loader_dri3_swap_buffers_msc (draw=0x866da8, target_msc=0, divisor=0, remainder=0, flush_flags=3, force_copy=false) at loader_dri3_helper.c:641
#10 0x00007fffe1955c7e in dri3_swap_buffers (pdraw=0x866d70, target_msc=0, divisor=0, remainder=0, flush=1) at dri3_glx.c:521
#11 0x00007fffe190744f in glXSwapBuffers (dpy=0x6ebb20, drawable=109051906) at glxcmds.c:843
Comment 22 Steven Toth 2016-10-11 18:05:07 UTC
That's odd, the reference patch had no sensor changes.

Can you back the reference count patch out and check that sensors is still working from the last patch?
Comment 23 Christoph Haag 2016-10-11 18:16:22 UTC
Created attachment 127225 [details]
bt full with sensors

I should have clarified: With the two patches it still works in glxgears. The new segfault only happens in unigine heaven for me. Here is a new backtrace including lm_sensors 3.4.0 debug symbols.
Comment 24 Steven Toth 2016-10-12 14:50:13 UTC
Created attachment 127246 [details] [review]
Betetr version of the sensor_cleanup patch
Comment 25 Steven Toth 2016-10-12 14:52:03 UTC
Christoph,

Please remove the prior sensor-cleanup.patch, apply the attached sensor-cleanupv2.patch.

This should prevent freeing of an object while its being used by a second hud element.

Let me know, thanks.
Comment 26 Christoph Haag 2016-10-12 15:49:52 UTC
Created attachment 127250 [details]
success

With the patches
https://bugs.freedesktop.org/show_bug.cgi?id=98169#c19
https://bugs.freedesktop.org/show_bug.cgi?id=98169#c25
it works now.

Thanks for the fix.
Comment 27 Steven Toth 2016-10-12 16:36:29 UTC
Thanks for reporting these issue. I'll wrap them up and get them on the ML for review.

Care to give me your Tested-By sign-off?
Comment 28 Christoph Haag 2016-10-12 16:40:20 UTC
Sure.
Comment 29 Steven Toth 2016-10-13 14:40:05 UTC
Created attachment 127271 [details] [review]
Switch to internal reference counting and mutexing.

Replacement, add mutex locking on the lists and using the internal reference counting implementation.
Comment 30 Steven Toth 2016-10-13 14:44:43 UTC
Christoph, based on feedback from the mailing list, I made some changes to use the MESA internal reference counting mechanisms which makes sensor and other cleanup a little more robust.

The patch tests fine for me with glxgears, but due to some hardware issues I no longer have access to a working unigine environment.

If you have a moment, can you please remove all of my earlier patches and apply the mutex-reference.patch and check that it's still working for with unigine?

Specifically, that you still have values in all graphs and that you no longer see the sensor_cleanup() crash?

Many thanks.
Comment 31 Christoph Haag 2016-10-13 15:42:20 UTC
Yes, your new patch works. No crashing, all panes are visible and show the correct values.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.