On latest stable Archlinux with AMD HD7750, Mesa 17.3 breaks Firefox, Libreoffice, mpv with X11 backend and Chromium (when running WebGL). Launching firefox or glxinfo from command line results in no response and no crashes. There are no errors in dmesg or journalctl, except Probe failed with amdkfd, but blacklisting amdkfd doesn't solve the issue. Using strace on firefox there are system interrupted notices. Chromium works, but opening any WebGL page freezes Chromium. mpv with X11 backend shows Bad Drawable, server failed to allocate resource. Downgrading to mesa 17.2.6 solves the issue.
Similar experience here on Arch Linux with a Radeon HD 8870M / R9 M270X/M370X. Firefox hangs on launch before drawing a window, and Chromium won't start either unless I --disable-gpu. Similar strace results, no amdkfd errors. Archive package 2017/12/15/extra/os/x86_64/mesa-17.2.6-1 works fine, 2017/12/16/extra/os/x86_64/mesa-17.3.0-2 doesn't. Kernel 4.9.69-1-lts #1 SMP Thu Dec 14 19:51:07 CET 2017 x86_64 GNU/Linux, if it matters. I notice the Arch maintainer changed the build arg --enable-omx to --enable-omx-bellagio in this release. I know nothing of Mesa's build process, this being my first encounter, but I thought that might be useful to mention in case it's relevant.
Working on narrowing down the source of the issue, I've found so far that 17.2.7 builds work fine (though I had to change --enable-omx-bellagio back to --enable-omx), and 17.3.0-rc3 does not. I'll try an -rc1 build, then I may be on to learning how to package from a locally bisected git.
A cycle of git bisect building leads me here: 255573996cc997cb61be9adad3e8fcaa78db5d1f is the first bad commit commit 255573996cc997cb61be9adad3e8fcaa78db5d1f Author: Marek Olšák <marek.olsak@amd.com> Date: Mon Oct 9 18:56:22 2017 +0200 st/dri: implement __DRIimageExtension::validateUsage properly Reviewed-by: Daniel Stone <daniels@collabora.com> Reviewed-by: Nicolai Hähnle <nicolai.haehnle@amd.com> :040000 040000 691ace5458e444a262bd2632b0234f7c08711674 181aa1f345cafc40db98187dd9e66b115c8f7e52 M src I'm not familiar enough with C or this codebase to render an opinion about why this commit would have broken things, but I'd love to assist if someone with more perspective needs me to run/test things to determine a resolution here.
Created attachment 136333 [details] Bisect log Bisect log attached. I also had to revert c4ed39f85b (which I did via cherrypick applying b6f41e393e)to build some of these versions.
I can't reproduce this with weston or gnome-shell. Which Wayland compositor and which version of Xwayland are you using? Can you get a gdb backtrace of glxinfo when it hangs?
Created attachment 136345 [details] Hung glxinfo backtrace gnome-shell 3.26.2+9+ga3736d3a3-1 xorg-server-xwayland 1.19.5-1 Backtrace attached. I've also got strace output if it would be useful. Thanks for your help!
Created attachment 136350 [details] Hung glxinfo backtrace - mesa 17.3.1-1 On the chance you'd want to see a newer version, I built mesa 17.3.1-1 (with debug symbols this time!) and have the same issue. Backtrace attached.
*** Bug 104351 has been marked as a duplicate of this bug. ***
I also have this problem on Arch Linux with a Radeon HD 7770, mesa 17.3.1-2, gnome-shell 3.26.2+9+ga3736d3a3-1. The Gnome Wayland session starts, but several XWayland apps (Firefox, Thunderbird, steam client) do not. glxinfo shows no response when run, as reported above. Downgrading to mesa 17.2.6 appears to fix it, but for this card I can also work around it by switching to the amdgpu kernel module instead of the default radeon, which appears to resolve the issues under mesa 17.3.
Following up on this, it looks like my bisect wa bad. I retried commit 255573996cc997cb61be9adad3e8fcaa78db5d1f and it works fine. I will re-bisect to find the real offending commit and report when I have an answer.
I can confirm that switching from radeon to amdgpu on 03:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Pitcairn XT [Radeon HD 7870 GHz Edition] Device: AMD Radeon HD 7800 Series (PITCAIRN / DRM 3.19.0 / 4.14.11-1-ARCH, LLVM 5.0.1) (0x6818) fixes the issue.
Subsequent bisects (starting at different ranges so different commits would be tested) have led me to the same place. No idea why I thought 255573996 worked for a while; all new builds from it break for me. I have Pacman packages for every build I've made if anyone would find them useful. I narrowed the problem to the final return statement by replacing it with "return true;" which resulted in a working build. I also tried to rule out side-effects from the call by adding "bind = screen->check_resource_capability(screen, image->texture, bind) ? 1 : 0;" before the return true so the method would be called but its return value ignored, assuming the compiler is not smart enough to optimize that out. That build worked too. I assume that capability check must sometimes be returning false in some unexpected situations, and something is not handling that. I tried making a debug build of mesa-demos, then gdb glxinfo and set a breakpoint on dri2_validate_usage, but that method appears not to be called before the process hangs. I assume something during system startup calls it, caches the result, and leads to these issues downstream? If anyone has any guidance on next steps, I'd appreciate it.
(In reply to eric vz from comment #12) > [...] gdb glxinfo and set a breakpoint on dri2_validate_usage, but that method > appears not to be called before the process hangs. It's not called in the client but in the Wayland compositor process, or maybe in Xwayland.
Thanks, Michel. On a hanging call to glxinfo, dri2_validate_usage is being called from gbm_dri_bo_import, which gets parameters of a 100x100 FD image buffer for usage 5 (scanout | rendering). However, the image texture winds up with is_displayable = 0, so si_check_resource_capabilities returns false, gbm_dri_bo_import destroys the image, and glxinfo hangs. I confirmed that if I set variable usage = 0 the problem does not occur. I have not yet been able to trace where the texture flags are set, but I got far enough to find that they're being read from the image buffer as metadata. However, it occurs to me that as a newcomer to the domain (e.g., I had to google what a scanout is), I don't know the real problem: * Should usage include scanout? * Should the image metadata include is_displayable? * Should the null return from gbm_dri_bo_import cause the caller to hang? Either point 1 or point 2 seems like it must be wrong, and might be the easiest fix. Point 1 seems the stronger case, since glxinfo doesn't actually draw anything as far as I'm aware. Since Firefox, Chromium, and other apps also seem to fail on this, I'm expecting to find some kind of common initialization code that does it. Will report back when I get time to dig. Or, if anyone knows better than I do which avenue to pursue, I'm happy to be redirected. I don't expect to have time to look at this until the weekend.
Thanks for the help, Eric. https://patchwork.freedesktop.org/patch/200999/ fixes this for me.
Thanks a lot, Michel! I confirmed that 255573996 plus the linked patch works for me as well.
Thanks for the report, fixed in Git master (and should get backported to a future 17.3.y release): Commit: 1cf1bf32eff5ffca0b928c0884b0e792207b61b7 URL: http://cgit.freedesktop.org/mesa/mesa/commit/?id=1cf1bf32eff5ffca0b928c0884b0e792207b61b7 Author: Michel Dänzer <michel.daenzer@amd.com> Date: Fri Jan 26 18:32:32 2018 +0100 winsys/radeon: Compute is_displayable in surf_drm_to_winsys
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.