Bug 104306

Summary: Mesa 17.3 breaks Firefox and other Xwayland apps on AMD HD7750
Product: Mesa Reporter: Mesa Bug <mesabug11>
Component: Drivers/Gallium/radeonsiAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED FIXED QA Contact: Default DRI bug account <dri-devel>
Severity: normal    
Priority: medium CC: breno.ec, evonzee+freedesktop, jan.public
Version: 17.3   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments: Bisect log
Hung glxinfo backtrace
Hung glxinfo backtrace - mesa 17.3.1-1

Description Mesa Bug 2017-12-18 00:53:49 UTC
On latest stable Archlinux with AMD HD7750, Mesa 17.3 breaks Firefox, Libreoffice, mpv with X11 backend and Chromium (when running WebGL).

Launching firefox or glxinfo from command line results in no response and no crashes. There are no errors in dmesg or journalctl, except Probe failed with amdkfd, but blacklisting amdkfd doesn't solve the issue. Using strace on firefox there are system interrupted notices. Chromium works, but opening any WebGL page freezes Chromium. mpv with X11 backend shows Bad Drawable, server failed to allocate resource.

Downgrading to mesa 17.2.6 solves the issue.
Comment 1 eric vz 2017-12-19 00:21:47 UTC
Similar experience here on Arch Linux with a Radeon HD 8870M / R9 M270X/M370X.  Firefox hangs on launch before drawing a window, and Chromium won't start either unless I --disable-gpu.  Similar strace results, no amdkfd errors.

Archive package 2017/12/15/extra/os/x86_64/mesa-17.2.6-1 works fine, 2017/12/16/extra/os/x86_64/mesa-17.3.0-2 doesn't.  Kernel 4.9.69-1-lts #1 SMP Thu Dec 14 19:51:07 CET 2017 x86_64 GNU/Linux, if it matters.

I notice the Arch maintainer changed the build arg --enable-omx to --enable-omx-bellagio in this release.  I know nothing of Mesa's build process, this being my first encounter, but I thought that might be useful to mention in case it's relevant.
Comment 2 eric vz 2017-12-19 22:30:13 UTC
Working on narrowing down the source of the issue, I've found so far that 17.2.7 builds work fine (though I had to change --enable-omx-bellagio back to --enable-omx), and 17.3.0-rc3 does not.  I'll try an -rc1 build, then I may be on to learning how to package from a locally bisected git.
Comment 3 eric vz 2017-12-20 19:12:34 UTC
A cycle of git bisect building leads me here:

255573996cc997cb61be9adad3e8fcaa78db5d1f is the first bad commit
commit 255573996cc997cb61be9adad3e8fcaa78db5d1f
Author: Marek Olšák <marek.olsak@amd.com>
Date:   Mon Oct 9 18:56:22 2017 +0200

    st/dri: implement __DRIimageExtension::validateUsage properly
    
    Reviewed-by: Daniel Stone <daniels@collabora.com>
    Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com>

:040000 040000 691ace5458e444a262bd2632b0234f7c08711674 181aa1f345cafc40db98187dd9e66b115c8f7e52 M	src

I'm not familiar enough with C or this codebase to render an opinion about why this commit would have broken things, but I'd love to assist if someone with more perspective needs me to run/test things to determine a resolution here.
Comment 4 eric vz 2017-12-20 19:26:28 UTC
Created attachment 136333 [details]
Bisect log

Bisect log attached.  I also had to revert c4ed39f85b (which I did via cherrypick applying b6f41e393e)to build some of these versions.
Comment 5 Michel Dänzer 2017-12-21 16:45:49 UTC
I can't reproduce this with weston or gnome-shell.

Which Wayland compositor and which version of Xwayland are you using?

Can you get a gdb backtrace of glxinfo when it hangs?
Comment 6 eric vz 2017-12-21 17:18:09 UTC
Created attachment 136345 [details]
Hung glxinfo backtrace

gnome-shell 3.26.2+9+ga3736d3a3-1
xorg-server-xwayland 1.19.5-1

Backtrace attached.  I've also got strace output if it would be useful.  Thanks for your help!
Comment 7 eric vz 2017-12-21 20:33:16 UTC
Created attachment 136350 [details]
Hung glxinfo backtrace - mesa 17.3.1-1

On the chance you'd want to see a newer version, I built mesa 17.3.1-1 (with debug symbols this time!) and have the same issue.  Backtrace attached.
Comment 8 Emil Velikov 2017-12-28 20:28:30 UTC
*** Bug 104351 has been marked as a duplicate of this bug. ***
Comment 9 hazies 2018-01-02 21:14:38 UTC
I also have this problem on Arch Linux with a Radeon HD 7770, mesa 17.3.1-2, gnome-shell 3.26.2+9+ga3736d3a3-1.  The Gnome Wayland session starts, but several XWayland apps (Firefox, Thunderbird, steam client) do not.  glxinfo shows no response when run, as reported above.

Downgrading to mesa 17.2.6 appears to fix it, but for this card I can also work around it by switching to the amdgpu kernel module instead of the default radeon, which appears to resolve the issues under mesa 17.3.
Comment 10 eric vz 2018-01-03 19:51:09 UTC
Following up on this, it looks like my bisect wa bad.  I retried commit 255573996cc997cb61be9adad3e8fcaa78db5d1f and it works fine. 

I will re-bisect to find the real offending commit and report when I have an answer.
Comment 11 Sergey Kvachonok 2018-01-05 06:22:58 UTC
I can confirm that switching from radeon to amdgpu on

03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Pitcairn XT [Radeon HD 7870 GHz Edition]

Device: AMD Radeon HD 7800 Series (PITCAIRN / DRM 3.19.0 / 4.14.11-1-ARCH, LLVM 5.0.1) (0x6818)

fixes the issue.
Comment 12 eric vz 2018-01-09 00:50:11 UTC
Subsequent bisects (starting at different ranges so different commits would be tested) have led me to the same place. No idea why I thought 255573996 worked for a while; all new builds from it break for me.  I have Pacman packages for every build I've made if anyone would find them useful.

I narrowed the problem to the final return statement by replacing it with "return true;" which resulted in a working build.  I also tried to rule out side-effects from the call by adding "bind = screen->check_resource_capability(screen, image->texture, bind) ? 1 : 0;" before the return true so the method would be called but its return value ignored, assuming the compiler is not smart enough to optimize that out.  That build worked too. 

I assume that capability check must sometimes be returning false in some unexpected situations, and something is not handling that.  I tried making a debug build of mesa-demos, then gdb glxinfo and set a breakpoint on dri2_validate_usage, but that method appears not to be called before the process hangs.  I assume something during system startup calls it, caches the result, and leads to these issues downstream?

If anyone has any guidance on next steps, I'd appreciate it.
Comment 13 Michel Dänzer 2018-01-09 13:53:35 UTC
(In reply to eric vz from comment #12)
> [...] gdb glxinfo and set a breakpoint on dri2_validate_usage, but that method
> appears not to be called before the process hangs.

It's not called in the client but in the Wayland compositor process, or maybe in Xwayland.
Comment 14 eric vz 2018-01-10 16:47:17 UTC
Thanks, Michel.  On a hanging call to glxinfo, dri2_validate_usage is being called from gbm_dri_bo_import, which gets parameters of a 100x100 FD image buffer for usage 5 (scanout | rendering).  However, the image texture winds up with is_displayable = 0, so si_check_resource_capabilities returns false, gbm_dri_bo_import destroys the image, and glxinfo hangs.  I confirmed that if I set variable usage = 0 the problem does not occur.  I have not yet been able to trace where the texture flags are set, but I got far enough to find that they're being read from the image buffer as metadata.  

However, it occurs to me that as a newcomer to the domain (e.g., I had to google what a scanout is), I don't know the real problem:

* Should usage include scanout?
* Should the image metadata include is_displayable? 
* Should the null return from gbm_dri_bo_import cause the caller to hang?

Either point 1 or point 2 seems like it must be wrong, and might be the easiest fix. Point 1 seems the stronger case, since glxinfo doesn't actually draw anything as far as I'm aware. Since Firefox, Chromium, and other apps also seem to fail on this, I'm expecting to find some kind of common initialization code that does it.  Will report back when I get time to dig.  Or, if anyone knows better than I do which avenue to pursue, I'm happy to be redirected.  I don't expect to have time to look at this until the weekend.
Comment 15 Michel Dänzer 2018-01-26 17:35:01 UTC
Thanks for the help, Eric.

https://patchwork.freedesktop.org/patch/200999/ fixes this for me.
Comment 16 eric vz 2018-01-26 23:24:13 UTC
Thanks a lot, Michel!  I confirmed that 255573996 plus the linked patch works for me as well.
Comment 17 Michel Dänzer 2018-01-31 09:07:46 UTC
Thanks for the report, fixed in Git master (and should get backported to a future 17.3.y release):

Commit: 1cf1bf32eff5ffca0b928c0884b0e792207b61b7
URL:    http://cgit.freedesktop.org/mesa/mesa/commit/?id=1cf1bf32eff5ffca0b928c0884b0e792207b61b7

Author: Michel Dänzer <michel.daenzer@amd.com>
Date:   Fri Jan 26 18:32:32 2018 +0100

winsys/radeon: Compute is_displayable in surf_drm_to_winsys

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.