Bug 104214 - Dota crashes when switching from game to desktop
Summary: Dota crashes when switching from game to desktop
Status: RESOLVED FIXED
Alias: None
Product: Mesa
Classification: Unclassified
Component: Other (show other bugs)
Version: 17.3
Hardware: Other All
: medium normal
Assignee: Thomas Hellström
QA Contact: mesa-dev
URL:
Whiteboard:
Keywords:
: 104301 104342 104392 104443 104579 104583 (view as bug list)
Depends on:
Blocks:
 
Reported: 2017-12-11 16:35 UTC by Sven
Modified: 2018-01-20 20:20 UTC (History)
15 users (show)

See Also:
i915 platform:
i915 features:


Attachments
GDB output (3.05 KB, text/plain)
2018-01-01 18:26 UTC, Cyril
Details
Fail gracefully when make_surface returns NULL (1.11 KB, patch)
2018-01-02 23:23 UTC, Ian Romanick
Details | Splinter Review
gdb output from crash after alt+tab switching (2.62 KB, text/plain)
2018-01-05 22:14 UTC, Sven
Details
Patch to fix a potential problem with renderbuffer freeing (593 bytes, patch)
2018-01-10 09:45 UTC, Thomas Hellström
Details | Splinter Review
Updated patch to avoid freeing renderbuffers currently in use (1.43 KB, patch)
2018-01-10 10:33 UTC, Thomas Hellström
Details | Splinter Review

Description Sven 2017-12-11 16:35:06 UTC
I upgraded from mesa 17.2.6 to mesa 17.3. Since then, dota2 crashes when switching from the game to the desktop with alt+tab. It may crash on the first attempt, but usually I do not need more than 3 attempts to get the crash.

My laptop has an Intel graphics card (i7-7700HQ) and a downgrade to mesa 17.2.x fixes the issue completely.
Comment 1 Sven 2017-12-11 16:38:41 UTC
Steps to reproduce:
Start Dota, stay on the main menu screen, switch back and forth between the desktop and Dota. I'm using the mate desktop environment.

Result: (from dmesg)
[ 8804.741329] traps: GLRenderThread[18545] general protection ip:7f5136de2e81 sp:7f5135089960 error:0 in i965_dri.so[7f51369f8000+795000]
Comment 2 Eero Tamminen 2017-12-11 17:02:53 UTC
I cannot reproduce this with KBL (i7-7567U) running Ubuntu 16.04 LTS with latest Mesa git.

Can you reproduce the issue if you disable/change the "Enable Steam Overlay while in-game" option from the DOTA2 Steam properties?


Could you provide backtrace of the issue?

0. Install Gdb (in Debian/Ubuntu: "sudo apt install gdb")
1. Start DOTA2
2. Connect Gdb to it (from a terminal) with following command:
   sudo gdb /proc/$(pidof dota2)/self $(pidof dota2)
3. Enter "c" to Gdb, to continue
4. Reproduce the DOTA2 crash
5. Attach here the *full* output of "thread apply all bt" Gdb command

(Preferably you should do that after installing Mesa debug symbols package from the same repository where you installed Mesa from. What package name / what repository you need for that is distribution specific.)
Comment 3 Eero Tamminen 2017-12-11 17:08:52 UTC
(In reply to Eero Tamminen from comment #2)
>    sudo gdb /proc/$(pidof dota2)/self $(pidof dota2)

Sorry, should be:
     sudo gdb /proc/$(pidof dota2)/exe $(pidof dota2)
Comment 4 Sven 2017-12-11 17:43:14 UTC
Yes, I can reproduce the issue with steam overlay disabled.

I will have to try gdb some other day when I have some more time.
Comment 5 Sven 2017-12-24 13:39:18 UTC
The issue didn't occur at work, where i have an i5-4690. At work I have Arch Linux with mesa 17.3.0.

On my private laptop, I switched from Gentoo to Arch Linux yesterday (not because of this issue). Arch also comes with mesa 17.3.0 and the issue occurs too. In fact, it got worse in the sense that dota doesn't even start anymore (crashed during black screen with dota logo, main menu doesn't show). Again, like on Gentoo, downgrading from 17.3.0 to 17.2.6 fixed the issue and dota starts (yes, I only downgraded the mesa package, didn't even restart X11). My laptop has an i7-7700HQ.

I saw that Arch also has an 17.3.1 package. I will try that next. Then I will try to provide some gdb trace or so.
Comment 6 Sven 2017-12-29 23:03:42 UTC
When using mesa 17.3.1 on Arch Linux, Dota still crashed on startup. The error is

[ 3047.872388] GLRenderThread[12897]: segfault at 44 ip 00007fc93f36fce0 sp 00007fc93d1ec910 error 4 in i965_dri.so[7fc93ef5a000+7f8000]

The Arch Linux package version of mesa is 17.3.1-2. I'm investigating on how to do the "thread apply all bt". I cannot attach gdb to the PID of a running dota by hand, cause dota crashes on start right now.
Comment 7 Cyril 2018-01-01 18:23:09 UTC
Same here, i am unable to start Dota since mesa 17.3.1. The game crash just after the Dota2 launch screen.

GLRenderThread[887]: segfault at 44 ip 00007ff219676ce0 sp 00007ff2174f3910 error 4 in i965_dri.so[7ff219261000+7f8000]

Reverting mesa to 17.2.6 "fix" the issue.

I am also on Archlinux and i have an i7-7500U.

Built mesa with debug symbols and attached gdb with the following command : GAME_DEBUGGER=gdb steam

Here's the result :

Thread 4 "GLRenderThread" received signal SIGSEGV, Segmentation fault.
intel_miptree_choose_aux_usage (brw=brw@entry=0x557d90f43030, mt=0x0) at intel_mipmap_tree.c:366
366	   if (intel_miptree_supports_mcs(brw, mt)) {
(gdb) bt
#0  intel_miptree_choose_aux_usage (brw=brw@entry=0x557d90f43030, mt=0x0) at intel_mipmap_tree.c:366
#1  0x00007faee101f98e in miptree_create (flags=MIPTREE_CREATE_BUSY, num_samples=1, depth0=1, height0=21885, width0=2463255520, last_level=0, first_level=0, format=<optimized out>, target=3553, brw=0x557d90f43030) at intel_mipmap_tree.c:725
#2  intel_miptree_create (brw=brw@entry=0x557d90f43030, target=<optimized out>, format=<optimized out>, first_level=first_level@entry=0, last_level=last_level@entry=0, width0=width0@entry=2463255520, height0=21885, depth0=1, num_samples=1, flags=MIPTREE_CREATE_BUSY)
    at intel_mipmap_tree.c:781
#3  0x00007faee101fb6e in intel_miptree_create_for_renderbuffer (brw=brw@entry=0x557d90f43030, format=<optimized out>, width=width@entry=2463255520, height=height@entry=21885, num_samples=<optimized out>) at intel_mipmap_tree.c:1172
#4  0x00007faee101b018 in intel_alloc_private_renderbuffer_storage (ctx=0x557d90f43030, rb=0x557d8f7c1c00, internalFormat=6402, width=2463255520, height=21885) at intel_fbo.c:305
#5  0x00007faee0d2bf64 in _mesa_resize_framebuffer (ctx=ctx@entry=0x557d90f43030, fb=0x557d8f895900, width=2463255520, height=21885) at main/framebuffer.c:298
#6  0x00007faee0fbb1a4 in driUpdateFramebufferSize (ctx=ctx@entry=0x557d90f43030, dPriv=dPriv@entry=0x557d90f11800) at dri_util.c:842
#7  0x00007faee0ff873c in intel_update_renderbuffers (context=context@entry=0x557d90f2e450, drawable=drawable@entry=0x557d90f11800) at brw_context.c:1362
#8  0x00007faee0ff8e41 in intel_prepare_render (brw=brw@entry=0x557d90f43030) at brw_context.c:1379
#9  0x00007faee0ff470e in brw_clear (ctx=0x557d90f43030, mask=256) at brw_clear.c:278
#10 0x00007faee0c910a5 in clear_bufferfv (no_error=false, value=0x557d915d28b4, drawbuffer=0, buffer=6144, ctx=0x557d90f43030) at main/clear.c:599
#11 _mesa_ClearBufferfv (buffer=6144, drawbuffer=0, value=0x557d915d28b4) at main/clear.c:634
#12 0x00007faedf641644 in ?? () from /home/cduez/.local/share/Steam/steamapps/common/dota 2 beta/game/bin/linuxsteamrt64/librendersystemgl.so
#13 0x00007faedf62285f in ?? () from /home/cduez/.local/share/Steam/steamapps/common/dota 2 beta/game/bin/linuxsteamrt64/librendersystemgl.so
#14 0x00007faedf5eda3f in ?? () from /home/cduez/.local/share/Steam/steamapps/common/dota 2 beta/game/bin/linuxsteamrt64/librendersystemgl.so
#15 0x00007faedf5ee23c in ?? () from /home/cduez/.local/share/Steam/steamapps/common/dota 2 beta/game/bin/linuxsteamrt64/librendersystemgl.so
#16 0x00007faedf60e5b1 in ?? () from /home/cduez/.local/share/Steam/steamapps/common/dota 2 beta/game/bin/linuxsteamrt64/librendersystemgl.so
#17 0x00007faee7efbe26 in ?? () from /home/cduez/.local/share/Steam/steamapps/common/dota 2 beta/game/bin/linuxsteamrt64/libtier0.so
#18 0x00007faeeb82408a in start_thread () from /usr/lib/libpthread.so.0
#19 0x00007faeebb3142f in clone () from /usr/lib/libc.so.6

I am not a gdb expert so if you need further informations, let me know.
Comment 8 Cyril 2018-01-01 18:26:25 UTC
Created attachment 136472 [details]
GDB output
Comment 9 Cyril 2018-01-01 18:29:03 UTC
I have put the gdb output in attachment for more clarity, sorry for the spam.

Also we might want to edit the subject because since mesa 17.3.1 it's not only a problem about the overlay but the game doesn't start at all.
Comment 10 Sven 2018-01-02 01:44:01 UTC
Thank you Cyril for providing the requested gdb trace.
Comment 11 Cyril 2018-01-02 14:35:23 UTC
Used git bisect (between tag mesa-17.3.1 and mesa-17.2.6) and tested if i could launch the game or not for each iteration. Got this commit at the end :


15e208c4ccdd94582a459d0066b587f91caf270c is the first bad commit
commit 15e208c4ccdd94582a459d0066b587f91caf270c
Author: Thomas Hellstrom <thellstrom@vmware.com>
Date:   Thu Sep 14 13:09:05 2017 +0200

    loader/dri3: Don't accidently free buffer holding new back content
    
    Avoid freeing buffers holding new back content
    (with GLX_SWAP_COPY_OML and GLX_SWAP_EXCHANGE_OML)
    Prevously that would have resulted in back buffer content becoming
    incorrect after a swap, although I haven't managed to trigger such a
    situation yet.
    
    Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com>
    Reviewed-by: Sinclair Yeh <syeh@vmware.com>



I was able to launch dota by reverting this one but i guess it's not the proper fix :D
Comment 12 Ian Romanick 2018-01-02 23:23:09 UTC
Created attachment 136512 [details] [review]
Fail gracefully when make_surface returns NULL

Does this patch help?  If there is any difference in behavior with this patch, can you describe it?
Comment 13 Cyril 2018-01-03 09:22:59 UTC
The game start with the patch on top of mesa-17.3.1.

Also tried the Steam overlay (Shift+tab) and switching desktop while in game, had no crash so it seems good.
Comment 14 Ian Romanick 2018-01-03 15:22:31 UTC
(In reply to Cyril from comment #13)
> The game start with the patch on top of mesa-17.3.1.
> 
> Also tried the Steam overlay (Shift+tab) and switching desktop while in
> game, had no crash so it seems good.

And does everything look correct too?
Comment 15 Cyril 2018-01-03 16:26:43 UTC
(In reply to Ian Romanick from comment #14)
> (In reply to Cyril from comment #13)
> > The game start with the patch on top of mesa-17.3.1.
> > 
> > Also tried the Steam overlay (Shift+tab) and switching desktop while in
> > game, had no crash so it seems good.
> 
> And does everything look correct too?

I don't see any artifacts or glitches, it does look good to me. In the other hand, i only tested for 3/4 min as i don't have enough time for an entire game right now.
Comment 16 Ian Romanick 2018-01-03 20:55:24 UTC
(In reply to Cyril from comment #15)
> (In reply to Ian Romanick from comment #14)
> > (In reply to Cyril from comment #13)
> > > The game start with the patch on top of mesa-17.3.1.
> > > 
> > > Also tried the Steam overlay (Shift+tab) and switching desktop while in
> > > game, had no crash so it seems good.
> > 
> > And does everything look correct too?
> 
> I don't see any artifacts or glitches, it does look good to me. In the other
> hand, i only tested for 3/4 min as i don't have enough time for an entire
> game right now.

Excellent.  I've sent a patch to the list with your Tested-by added.

https://patchwork.freedesktop.org/patch/195626/
Comment 17 Sven 2018-01-03 22:36:09 UTC
The bug I originally reported (dota crashed when switching between game and desktop) is not yet fixed. I managed to rebuild mesa with the patch applied. I will try to enable debug symbols next and then I'll try to provide the gdb trace like Cyril did.
Comment 18 Evangelos Foutras 2018-01-03 22:45:15 UTC
Could this be related to bug 104342 (and the most likely duplicate bugs I've added to its "See Also" bugs)? The backtrace here looks a bit different but the brw_clear() call is common (and a few other calls further up).
Comment 19 Sven 2018-01-03 23:00:13 UTC
It sound like bug 104392 .
Comment 20 Cyril 2018-01-04 09:38:14 UTC
Sven, to enable debug symbol you have to add 

options=(debug !strip)

in the PKGBUILB.


And then you have: 

GAME_DEBUGGER=gdb steam

to attach gdb to the game.


I wasn't able to reproduce the crash with the alt-tab, even this morning. I am using i3 as window manager.
Comment 21 Sven 2018-01-05 22:14:42 UTC
Created attachment 136577 [details]
gdb output from crash after alt+tab switching

And here's the output after dota crashed when I switch application with alt+tab.
Comment 22 Evangelos Foutras 2018-01-07 08:09:53 UTC
(In reply to Cyril from comment #11)
> Used git bisect (between tag mesa-17.3.1 and mesa-17.2.6) and tested if i
> could launch the game or not for each iteration. Got this commit at the end :
> 
> 15e208c4ccdd94582a459d0066b587f91caf270c is the first bad commit

I reached the same commit as the first commit that triggers segfaults with mpv (see bug 104376 and its "see also" bugs).

The patch from comment 12 does *not* fix my mpv issue (testing on top of master), so perhaps the actual issue is with commit 15e208c4cc? [1]

[1] https://cgit.freedesktop.org/mesa/mesa/commit/?id=15e208c4cc
Comment 23 Mark Janes 2018-01-08 21:58:24 UTC
Adding Thomas to the CC, since it bisects to his commit.
Comment 24 Sven 2018-01-09 02:32:43 UTC
I'm sorry but multiple issue seem to be discussed here:
- dota crashing on startup (seems to be fixed)
- dota crashing on alt+tab (not yet fixed)
- mpv crashing (not yet fixed)


The alt+tab crash I'm seeing seems to stem from a bug or problem in intel_miptree_create_for_dri_image. I posted the stack trace above.

(In reply to Sven from comment #19)
> It sound like bug 104392 .

I must revert that comment. The backtrace posted there is significantly different.

(In reply to Evangelos Foutras from comment #18)
> Could this be related to bug 104342 (and the most likely duplicate bugs I've
> added to its "See Also" bugs)? The backtrace here looks a bit different but
> the brw_clear() call is common (and a few other calls further up).

I'm not sure whether this is a duplicate of bug 104342. But yes, my backtrace also contains brw_clear(). Is there some theory on what's going on or a patch I could test?
Comment 25 Thomas Hellström 2018-01-10 09:45:18 UTC
Created attachment 136645 [details] [review]
Patch to fix a potential problem with renderbuffer freeing

There is a potential problem with the bisected commit.
I've attached a patch that should remedy that problem if that really was the cause of the problem.

Having said that, it appears like the backtraces are taken with mesa_glthread on.

DRI3 is not really thread-safe, so basically any change however small can break that mode of operation, so please also retest with

export mesa_glthread=0

then relaunch the app.
Comment 26 Evangelos Foutras 2018-01-10 09:48:27 UTC
Any Mesa devs trying to repro this, please test with display compositing disabled.

I posted some possibly useful printf output in [1]. I'm not sure if dri3_handle_present_event() is wrong to free buffers that don't match "buf->pixmap == ie->pixmap" but the commit message [2] doesn't seem to sufficiently explain this.

[1] https://bugs.freedesktop.org/show_bug.cgi?id=104376#c11

[2] https://cgit.freedesktop.org/mesa/mesa/commit/?id=15e208c4cc
Comment 27 Evangelos Foutras 2018-01-10 10:18:52 UTC
(In reply to Thomas Hellström from comment #25)
> Created attachment 136645 [details] [review] [review]
> Patch to fix a potential problem with renderbuffer freeing

Doesn't fix it for me (testing with mpv; probably same for dota).

With the printf()s from bug 104376, and printing !!buf->busy just before the dri3_free_render_buffer() call in dri3_handle_present_event(), I get this just before it crashes:

==================
buf->busy = 0
dri3_handle_present_event() freed 0x7fffbc8d6380
dri3_get_buffer() freeing buffer = 0x7fffbc8d6380; draw->buffers[buf_id] = (nil)
==================

I'm totally unfamiliar with gpu driver code but I'm wondering why the "buf->pixmap == ie->pixmap" check isn't used anymore; why didn't commit 15e208c4cc just stick "draw->cur_blit_source != b" into the second if conditional and leave the rest as it was?
Comment 28 Thomas Hellström 2018-01-10 10:26:00 UTC
(In reply to Evangelos Foutras from comment #27)

> I'm totally unfamiliar with gpu driver code but I'm wondering why the
> "buf->pixmap == ie->pixmap" check isn't used anymore; why didn't commit
> 15e208c4cc just stick "draw->cur_blit_source != b" into the second if
> conditional and leave the rest as it was?

Because then we would potentially leak buffers since we'd have nothing that freed that buffer if freeing had been skipped because draw->cur_blit_source != b.

Anyway, I think I know what might be going on now. Updated patch soon.

/Thomas
Comment 29 Thomas Hellström 2018-01-10 10:33:10 UTC
Created attachment 136647 [details] [review]
Updated patch to avoid freeing renderbuffers currently in use

So what might be happening is that dri3_get_buffer() decides to reuse a render buffer, but then we switch from page-flipping to non-page flipping and receive an event that lowers the back buffer count and thus also frees that render buffer from under dri3_get_buffer().

So only allow updating the back buffer count in dri3_find_back().
Comment 30 Evangelos Foutras 2018-01-10 11:49:52 UTC
(In reply to Thomas Hellström from comment #29)
> Created attachment 136647 [details] [review] [review]
> Updated patch to avoid freeing renderbuffers currently in use

Seems to fix my mpv crashes, thanks! (Perhaps Cyril or Sven might be able to confirm the same about dota.)

FWIW the "buf->busy == 0" condition seems to always evaluate to true in my tests; if I omit it and print !!buf->busy before the dri3_free_render_buffer() call, it always outputs "0".
Comment 31 Thomas Hellström 2018-01-10 12:33:37 UTC
(In reply to Evangelos Foutras from comment #30)

> FWIW the "buf->busy == 0" condition seems to always evaluate to true in my
> tests; if I omit it and print !!buf->busy before the
> dri3_free_render_buffer() call, it always outputs "0".

Yes, I think with the current hardcoded number of back buffers in the flip chain, it should probably always evaluate to true. However if someone in the future were to change that number, it might in some situations evaluate to false, so I'd rather keep that test around to make the code less fragile.

/Thomas
Comment 32 Thomas Hellström 2018-01-10 17:44:45 UTC
*** Bug 104376 has been marked as a duplicate of this bug. ***
Comment 33 Thomas Hellström 2018-01-10 17:46:56 UTC
Taking bug as it appears to be my fault.

Ian, please assign back if you feel differently.
Comment 34 Evangelos Foutras 2018-01-11 01:01:18 UTC
I built mesa/lib32-mesa packages for Arch Linux which include the patch from comment 29 (not sure if lib32-mesa is needed but it can't hurt either):

https://pkgbuild.com/~foutrelis/test-builds/mesa/bug-56944-patch-136647/

@Sven, @Cyril: Please check if dota still crashes (either at startup or alt-tab).

(The packages are signed with my packager key so you can 'pacman -U <URL>' them.)
Comment 35 Nicolas Frattaroli 2018-01-11 03:51:57 UTC
That patch fixes https://bugs.freedesktop.org/show_bug.cgi?id=104301 too, so I'm guessing this was the same issue.
Comment 36 Evangelos Foutras 2018-01-11 05:35:35 UTC
*** Bug 104342 has been marked as a duplicate of this bug. ***
Comment 37 Evangelos Foutras 2018-01-11 05:39:08 UTC
*** Bug 104301 has been marked as a duplicate of this bug. ***
Comment 38 Evangelos Foutras 2018-01-11 06:03:10 UTC
*** Bug 104443 has been marked as a duplicate of this bug. ***
Comment 39 Florian Bruhin 2018-01-11 08:07:10 UTC
FWIW the crashes I've seen in mpv and on YouTube when resizing the window also seem to be gone with the package from comment 34.
Comment 40 Andriy Khulap 2018-01-11 09:11:51 UTC
I was able to reproduce the Bug 104301 on the following system:
Intel(R) HD Graphics 530 (Skylake GT2)  (0x191b)
Ubuntu 16.04 LTS (Kernel 4.4.0)
Mesa 17.3.0 and latest from git.
mpv 0.14.0-git-e1993d5

Debugging showed the following sequence:
1. dri3_get_buffer() is called with draw->num_back=3 and so dri3_find_back() returns buf_id=2.
2. then dri3_fence_await() is called and starts to process events:
XCB_PRESENT_COMPLETE_NOTIFY which dri3_update_num_back() to num_back=2;
XCB_PRESENT_EVENT_IDLE_NOTIFY which free currently used buffer with id=2.
3. buffer becomes corrupted (e.g. buffer->image->bo=0x55), dri3_get_buffer() continues to use that buffer and then tries to dri3_free_render_buffer().
buffer->image->bo is not NULL so passes the check in brw_bo_unreference() and causes segmentation fault.

Patch from Comment 29 solved that issue for me, I can't reproduce it any more.
Comment 41 Andriy Khulap 2018-01-11 09:17:34 UTC
BTW my quick fix was removing the event processing from dri3_fence_await().
(added by a727c804a2c17db306c68e259ae845aa6382d3b1 loader/dri3: Process event after each fence wait)
Comment 42 Cyril 2018-01-11 09:21:12 UTC
(In reply to Evangelos Foutras from comment #34)
> I built mesa/lib32-mesa packages for Arch Linux which include the patch from
> comment 29 (not sure if lib32-mesa is needed but it can't hurt either):
> 
> https://pkgbuild.com/~foutrelis/test-builds/mesa/bug-56944-patch-136647/
> 
> @Sven, @Cyril: Please check if dota still crashes (either at startup or
> alt-tab).
> 
> (The packages are signed with my packager key so you can 'pacman -U <URL>'
> them.)

Tested and approved. I can launch dota. I wasn't affected by the alt-tab bug so i can't say if it better or not.
Comment 43 Evangelos Foutras 2018-01-11 09:39:39 UTC
*** Bug 104392 has been marked as a duplicate of this bug. ***
Comment 44 Thomas Hellström 2018-01-11 09:56:38 UTC
I sent out the patch for review on the mesa-dev list now.

If anybody wants a Reported-by: or Tested-by: tag, please reply to that email,

Thanks,
Thomas
Comment 45 Mark Janes 2018-01-11 22:41:03 UTC
Thomas:  do you have any ideas on how we could catch this category of bug in automated testing?  We have comprehensive automated tests for GL/Vulkan apis, but not much for dri3.
Comment 46 Sven 2018-01-12 00:42:31 UTC
(In reply to Evangelos Foutras from comment #30)
> (In reply to Thomas Hellström from comment #29)
> > Created attachment 136647 [details] [review] [review] [review]
> > Updated patch to avoid freeing renderbuffers currently in use
> 
> Seems to fix my mpv crashes, thanks! (Perhaps Cyril or Sven might be able to
> confirm the same about dota.)
> 
> FWIW the "buf->busy == 0" condition seems to always evaluate to true in my
> tests; if I omit it and print !!buf->busy before the
> dri3_free_render_buffer() call, it always outputs "0".

I applied the patch from comment #29 (and that patch only) and recompiled mesa 17.3.1. So far, dota hasn't crashed on alt+tab.
Comment 47 Thomas Hellström 2018-01-12 06:38:26 UTC
(In reply to Mark Janes from comment #45)
> Thomas:  do you have any ideas on how we could catch this category of bug in
> automated testing?  We have comprehensive automated tests for GL/Vulkan
> apis, but not much for dri3.

When enabling dri3 in our Xorg driver we caught a number of viewport bugs in mesa core dri3 using glretrace with various game traces. We currently do not support page-flipping in our xorg driver, which might be why this wasn't caught, but that would otherwise be a good candidate for automated testing:

Generate apitraces with frequent window resizing, and automate glretraces with image capture and image comparisons, and in addition find a way to trigger transition to- and from page-flipping.

/Thomas
Comment 48 ubitux 2018-01-12 08:51:23 UTC
*** Bug 104579 has been marked as a duplicate of this bug. ***
Comment 49 Tapani Pälli 2018-01-12 11:17:26 UTC
(In reply to Thomas Hellström from comment #47)
> (In reply to Mark Janes from comment #45)
> > Thomas:  do you have any ideas on how we could catch this category of bug in
> > automated testing?  We have comprehensive automated tests for GL/Vulkan
> > apis, but not much for dri3.
> 
> When enabling dri3 in our Xorg driver we caught a number of viewport bugs in
> mesa core dri3 using glretrace with various game traces. We currently do not
> support page-flipping in our xorg driver, which might be why this wasn't
> caught, but that would otherwise be a good candidate for automated testing:
> 
> Generate apitraces with frequent window resizing, and automate glretraces
> with image capture and image comparisons, and in addition find a way to
> trigger transition to- and from page-flipping.
> 
> /Thomas

I think Martin has/had some plans to add these kind of tests to his ezbench system, FYI Martin.
Comment 50 Martin Peres 2018-01-12 13:53:48 UTC
(In reply to Tapani Pälli from comment #49)
> (In reply to Thomas Hellström from comment #47)
> > (In reply to Mark Janes from comment #45)
> > > Thomas:  do you have any ideas on how we could catch this category of bug in
> > > automated testing?  We have comprehensive automated tests for GL/Vulkan
> > > apis, but not much for dri3.
> > 
> > When enabling dri3 in our Xorg driver we caught a number of viewport bugs in
> > mesa core dri3 using glretrace with various game traces. We currently do not
> > support page-flipping in our xorg driver, which might be why this wasn't
> > caught, but that would otherwise be a good candidate for automated testing:
> > 
> > Generate apitraces with frequent window resizing, and automate glretraces
> > with image capture and image comparisons, and in addition find a way to
> > trigger transition to- and from page-flipping.
> > 
> > /Thomas
> 
> I think Martin has/had some plans to add these kind of tests to his ezbench
> system, FYI Martin.

This is already implemented, however the error reporting in case the image does not get generated is not correct (just to clean up the trainee's code that landed in ezbench to be more resistant against that).

And by the way, Eero also added support for doing the same with vulkan traces (through vkreplay).
Comment 51 Jason Ekstrand 2018-01-13 00:40:21 UTC
Can we write a piglit test or two that reproduces this bug?  It would be very good if had a nice self-contained test that we can run in CI and avoid these types of issues in the future.
Comment 52 robsmith11 2018-01-13 06:11:48 UTC
*** Bug 104583 has been marked as a duplicate of this bug. ***
Comment 53 Thomas Hellström 2018-01-13 09:46:57 UTC
(In reply to Jason Ekstrand from comment #51)
> Can we write a piglit test or two that reproduces this bug?  It would be
> very good if had a nice self-contained test that we can run in CI and avoid
> these types of issues in the future.

As a comment to this, the origin of this bug is in itself an interesting sequence of events that might in one way answer your question:

1) Implemented and enabled dri3 support in the vmware Xorg driver.
2) A lot of glretrace automated tests started failing.
3) Fixed / worked around a number of viewport issues in the mesa dri3 core.
3a) Fixed a number of synchronization issues in the mesa dri3 core.
4) Some glretrace automated tests still failing.
5) Implemented dri3 SWAP_COPY_OML to have glretrace mimic the capturing platform behaviour (which was in some cases WGL)
6) SWAP_COPY_OML didn't work due to a long standing bug in mesa core dri.
7) Fixed that. glretrace started to render transparently with SWAP_COPY_OML fbconfigs.
8) This was caused by a long standing issue in the Xorg GLX layer. Worked around that.
9) Kwin and other applications started having transparent rendering issues.
10) Implemented a proper fix for 8)
11) Item 3) was reported to cause a multi-threaded game to fail to start on some platforms with page-flip only. Turned out mesa core dri3 is not thread-safe. Implemented a game-specific workaround.
12) This bug surfaced (was caused by 5)) and was fixed in a way that will decrease the chances of it ever happening again. (Perhaps an extra comment in the source is needed).

So why am I listing this? The reason is that I agree we need more piglit tests to verify GLX and dri3 functionality, but while this bug probably has had the strongest user impact, that doesn't necessarily mean it's likely to happen again. I think the biggest problem with the mesa dri3 implementation currently is that it was written without multi-threading in mind, and adding thread-safety as a hindsight might prove difficult and error-prone. It's also currently lacking a well-documented and well defined strategy to handle drawable size-changes and viewport changes.

So IMHO we should try to write piglit tests for the areas where we know there are remaining issues.

/Thomas
Comment 54 Jason Ekstrand 2018-01-13 19:27:26 UTC
(In reply to Thomas Hellström from comment #53)
> So IMHO we should try to write piglit tests for the areas where we know
> there are remaining issues.

That's a reasonable thing to say.  However, our DRI testing in piglit is rather pitiful across the board IMHO.  Part of the reason why you had to fix 6 different bugs in order to turn on DRI3 is that we've been lazy about testing when working on the DRI code.  Window system stuff is annoyingly full of edge-cases and the piglit tests we have tend to only touch-test things.  This is only one of the 3 or 4 DRI bugs we've shipped in the last few months that piglit has been perfectly happy with and then it messes up users badly.

While I would love to give someone the general task of improving piglit testing of DRI, a good place to start is to write tests for known bugs.  If we have a really nasty threading bug, let's make a test that hammers on threading.

Ok, I've said my piece.  Maybe I'm being unreasonable but it sounds good in my head. :-)
Comment 55 Juan A. Suarez 2018-01-18 20:02:09 UTC
Is this bug already fixed? With commit:

commit 897c54d522ab960a879b763a15e489f630c491ee
Author: Thomas Hellstrom <thellstrom@vmware.com>
Date:   Thu Jan 11 10:19:23 2018 +0100

    loader/dri3: Avoid freeing renderbuffers in use
Comment 56 Sven 2018-01-19 11:00:48 UTC
(In reply to Juan A. Suarez from comment #55)
> Is this bug already fixed? With commit:
> 
> commit 897c54d522ab960a879b763a15e489f630c491ee
> Author: Thomas Hellstrom <thellstrom@vmware.com>
> Date:   Thu Jan 11 10:19:23 2018 +0100
> 
>     loader/dri3: Avoid freeing renderbuffers in use

The patch from comment #29 (which is identical to the commit, right?) has been included in recent versions of Mesa on Arch Linux. So far, I have had zero crashes. So yes, I believe this is fixed.
Comment 57 Thomas Hellström 2018-01-19 12:39:11 UTC
Should be fixed now.
Comment 58 tazer 2018-01-20 20:20:43 UTC
I can confirm this has been fixed for me, I was experiencing the browser crashes explained in 104376.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.