Bug 105906 - [DRI3] Compiz segfaults in intel_destroy_image()
Summary: [DRI3] Compiz segfaults in intel_destroy_image()
Status: VERIFIED FIXED
Alias: None
Product: Mesa
Classification: Unclassified
Component: Drivers/DRI/i965 (show other bugs)
Version: git
Hardware: Other All
: high critical
Assignee: Intel 3D Bugs Mailing List
QA Contact: Intel 3D Bugs Mailing List
URL:
Whiteboard:
Keywords: bisected, patch, regression
Depends on:
Blocks: 106157
  Show dependency treegraph
 
Reported: 2018-04-05 15:50 UTC by Eero Tamminen
Modified: 2018-05-14 10:03 UTC (History)
3 users (show)

See Also:
i915 platform:
i915 features:


Attachments
Gdb backtrace of the crash (6.19 KB, text/plain)
2018-04-05 15:50 UTC, Eero Tamminen
Details

Note You need to log in before you can comment on or make changes to this bug.
Description Eero Tamminen 2018-04-05 15:50:57 UTC
Created attachment 138623 [details]
Gdb backtrace of the crash

Somewhere between following Mesa commits:
1e9d779331: 2018-03-08 18:14:02 UTC: meson: Fix building gallium media libs without egl
a2f08dd574: 2018-03-12 17:24:31 UTC: gallium: Use struct gl_array_attributes* as st_pipe_vertex_format argument.

Ubuntu 16.04 Unity Compiz started randomly crashing to NULL pointer access during our test-runs. Normally Unity desktop is able to successfully restart Compiz, so it can crash again.

During ~3 hour test runs, it will segfault a few times, which can be seen from dmes:
[ 8002.554441] compiz[5936]: segfault at 8 ip 00007fe34f8bcc34 sp 00007ffe0e44a810 error 4 in i965_dri.so[7fe34f4ac000+84e000]
[ 8046.153748] compiz[7073]: segfault at 8 ip 00007f218d4f7c34 sp 00007ffe8e5973f0 error 4 in i965_dri.so[7f218d0e7000+84e000]

I've seen these crashes on all platforms we have.

I was able to catch the crash twice in Gdb from 3 hour test-run, both times it was due to intel_destroy_image() getting a NULL pointer:
#0  intel_destroy_image (image=0x0)
#1  dri3_free_render_buffer ()
#2  dri3_get_buffer ()
#3  loader_dri3_get_buffers ()
#4  intel_update_image_buffers ()
#5  intel_update_renderbuffers ()
#6  intel_prepare_render ()
#7  brw_prepare_drawing ()
#8  brw_draw_prims ()
#9  vbo_draw_arrays ()
...
#22 CompositeScreen::handlePaintTimeout()

See attached full backtrace for details.

As this happens randomly i.e. seems to be timing related, my guess would be that it happens when application either starts or exits, and compositor happens to be doing screen update at the same time.

(Unfortunately I don't have data from between those Mesa dates.  Because issue takes long time to reproduce and is random, it's not bisection friendly.)

---

In dmesg outputs, the crash happens always on same VMA page in Mesa, on all platforms.  The actual crash instruction pointer address has couple of different addresses inside that (4K?) page, so it's possible that the above backtrace isn't the only one.

Crash happens both in a setup using slightly older kernel & X builds, and one using the latest git version of those i.e. it's due to a Mesa change, not one in other components (in the Ubuntu itself, in this time frame there was only update to libgcrypto20 to disable FIPS, if it was enabled).
Comment 1 Lionel Landwerlin 2018-04-05 16:06:26 UTC
Looks like a DRI3 issue. Cc Louis-Francis & Daniel who've worked on this recently.
Comment 2 vadym 2018-04-06 08:11:57 UTC
I'm experiencing the same issue when switching between windows. Firstly it was noticed on GFXBench. When doing ALT+TAB between GFXBench and Firefox similar crash is presented.  Issue looks very similar to the https://bugs.freedesktop.org/show_bug.cgi?id=104392 and
https://bugs.freedesktop.org/show_bug.cgi?id=104301. I checked that bugs and it reproduces again. 
Probably it should be reopened. Also I suppose bug with the Dota should also appear again https://bugs.freedesktop.org/show_bug.cgi?id=104214. 

Found bad commit:

commit 3160cb86aa9234ff78e11fe7a00f30bfb5cb8445
Author: Louis-Francis Ratté-Boulianne <lfrb@collabora.com>
Date:   Fri Oct 6 01:26:51 2017 -0400

    egl/x11: Re-allocate buffers if format is suboptimal
    
    If PresentCompleteNotify event says the pixmap was presented
    with mode PresentCompleteModeSuboptimalCopy, it means the pixmap
    could possibly have been flipped instead if allocated with a
    different format/modifier.
    
    Signed-off-by: Louis-Francis Ratté-Boulianne <lfrb@collabora.com>
    Reviewed-by: Daniel Stone <daniels@collabora.com>
Comment 3 Sergii Romantsov 2018-04-06 08:16:08 UTC
Proposed patch: https://lists.freedesktop.org/archives/mesa-dev/2018-April/191363.html
Comment 4 Andriy Khulap 2018-04-06 09:12:41 UTC
I can add that Bug 104301 is back and can be reproduced with Unity desktop on Ubuntu 16.04. But can't be reproduced on the same Ubuntu with xfce-desktop and Debian buster with Xfce.
Comment 5 Eero Tamminen 2018-04-06 10:06:53 UTC
(In reply to vadym from comment #2)
> Found bad commit:

Thanks for the bisect!


> commit 3160cb86aa9234ff78e11fe7a00f30bfb5cb8445
> Author: Louis-Francis Ratté-Boulianne <lfrb@collabora.com>
> Date:   Fri Oct 6 01:26:51 2017 -0400
> 
>     egl/x11: Re-allocate buffers if format is suboptimal

I'd recommend using --format=fuller option to get correct upstreaming date:
-----------------------------------------------
commit 3160cb86aa9234ff78e11fe7a00f30bfb5cb8445
Author:     Louis-Francis Ratté-Boulianne <lfrb@collabora.com>
AuthorDate: Fri Oct 6 01:26:51 2017 -0400
Commit:     Daniel Stone <daniels@collabora.com>
CommitDate: Fri Mar 9 17:47:14 2018 +0000

    egl/x11: Re-allocate buffers if format is suboptimal
-----------------------------------------------


(In reply to Sergii Romantsov from comment #3)
> Proposed patch:
> https://lists.freedesktop.org/archives/mesa-dev/2018-April/191363.html

Thanks, I just started 3h test-run to validate whether this fixes the issue completely and whether there's any perfomance impact.


(In reply to Andriy Khulap from comment #4)
> I can add that Bug 104301 is back and can be reproduced with Unity desktop
> on Ubuntu 16.04. But can't be reproduced on the same Ubuntu with
> xfce-desktop and Debian buster with Xfce.

AFAIK XFCE uses XRender to do compositing, not GL/ES, so it works quite differently  compared to most of the other compositors.
Comment 6 Eero Tamminen 2018-04-06 13:34:19 UTC
I've verified that patch fixes all the Compiz crashes and doesn't regress anything in our test-set.
Comment 7 Mark Janes 2018-04-06 15:13:23 UTC
Since this bug regresses in the same way as 104301 and 104214, is it time to make an automated test that will detect these types of errors?  Is that even possible?

The reference counting mechanism is clearly fragile.
Comment 8 Daniel Stone 2018-04-06 15:14:29 UTC
(In reply to Mark Janes from comment #7)
> Since this bug regresses in the same way as 104301 and 104214, is it time to
> make an automated test that will detect these types of errors?  Is that even
> possible?

Depends on what the environment is, I suppose: we'd need to start X with DRI3 support in a controlled manner, and AFAIK that involves complete TTY access.
Comment 9 Eero Tamminen 2018-04-17 09:15:46 UTC
At worst, compositor seems to crash about 10 times / hour because of this.

Compositor going away and being restarted is causing also other programs to fail to X errors, if they happen to start at the same time.
Comment 10 Eero Tamminen 2018-04-20 15:27:34 UTC
Btw. on newer Ubuntu release (17.10), this is a desktop killer bug.  Desktop dies along with Compiz, it doesn't get restarted like on 16.04.

Even on 16.04, desktop sometimes fails when Compiz goes down, although it's rare.

Sergii had fix available already 2 weeks ago, why it's not yet commited?
Comment 11 Eero Tamminen 2018-05-02 11:20:30 UTC
Latest patch fixing the issue:
https://patchwork.freedesktop.org/patch/219239/ (comments)
https://patchwork.freedesktop.org/patch/219923/

Compiz crashing will also sometimes cause other programs to crash (when they start, I assume) due to failing XGetProperty call.
Comment 12 Michel Dänzer 2018-05-09 14:05:35 UTC
Fixed in Git master:

Commit: 6f81e07ecb8c0793dc482307d5d96fd3df95b7d2
URL:    http://cgit.freedesktop.org/mesa/mesa/commit/?id=6f81e07ecb8c0793dc482307d5d96fd3df95b7d2

Author: Michel Dänzer <michel.daenzer@amd.com>
Date:   Fri Apr 27 17:41:48 2018 +0200

dri3: Only update number of back buffers in loader_dri3_get_buffers
Comment 13 Eero Tamminen 2018-05-14 10:03:44 UTC
Verified, the crashes are gone.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.