Bug 99236

Summary: System (seems to) completely freeze when interacting with java swing applications.
Product: Mesa Reporter: Vitaly Ostrosablin <tmp6154>
Component: Drivers/Gallium/radeonsiAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED FIXED QA Contact: Default DRI bug account <dri-devel>
Severity: major    
Priority: medium    
Version: unspecified   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments: dmesg output from faulty session
Xorg.0.log from faulty session
Java Swing reproducer application
Comparison of reproducer app under java 8 and java 7 with git mesa

Description Vitaly Ostrosablin 2016-12-31 13:46:37 UTC
I'm not completely certain whether it's related to AMD GPU driver or not, but it's rather strange issue that I can get stable reproduce for.

I'm running a Gentoo system with AMD Radeon RX 480, with git AMD GPU driver and git version of mesa. When interacting with scrollable JTextAreas in Java Swing application, I get a stably reproducible issue (100%). Since this affects multiple java swing applications (e.g. eclipse), including the program I'm developing at the time, I can attempt to put together a reproducer java app, if that's needed.

For first 2-3 seconds, mouse cursor moves jittery, then it stops to move at all and display keeps displaying same frozen state. Nothing appears to work, including Ctrl+Alt+F1, etc. But despite machine looking completely locked up, in fact, it's not. I can ssh into it and issue reboot command. During that time, display doesn't shows any signs of life until the moment when machine reboots (but I can hear KDE shutdown sound).

This occurs only under JRE8, both icedtea and oracle variants. Under JRE7 issue doesn't trigger. If memory serves, JRE8 brought improvements to hardware-accelerated GUI rendering. This is especially strange, since Swing GUI framework renders it's own GUI widgets.

Considering that even Ctrl+Alt+F1 doesn't work, I suppose problem happens on kernel level, possibly in AMD GPU driver.
Comment 1 Alex Deucher 2017-01-02 16:13:25 UTC
Please attach your xorg log and dmesg output.
Comment 2 Vitaly Ostrosablin 2017-01-03 08:54:25 UTC
Created attachment 128721 [details]
dmesg output from faulty session

Here's my dmesg. From these lines:

[  675.891897] amdgpu 0000:01:00.0: GPU fault detected: 147 0x00004802
[  675.891899] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[  675.891900] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02048002
[  675.891902] VM fault (0x02, vmid 1) at page 0, read from 'TC4' (0x54433400) (72)
[  675.892003] amdgpu 0000:01:00.0: GPU fault detected: 147 0x00004802
[  675.892004] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_ADDR   0x00000000
[  675.892006] amdgpu 0000:01:00.0:   VM_CONTEXT1_PROTECTION_FAULT_STATUS 0x02048002

It's obvious that something went terribly wrong inside AMD GPU.
Comment 3 Vitaly Ostrosablin 2017-01-03 09:00:04 UTC
Created attachment 128722 [details]
Xorg.0.log from faulty session

This is Xorg log. It doesn't seem that X noticed the fault at all. Moreover, with `ps -e` over ssh I can see X process and all other GUI programs still running just fine. So it seems that only GPU driver has failed, other stuff doesn't seem to be affected.
Comment 4 Vitaly Ostrosablin 2017-01-03 09:43:17 UTC
Created attachment 128723 [details]
Java Swing reproducer application

Here's Java sources of a reproducer mini-application for the issue. Running it and pressing the button results for me in AMD GPU fault, similar to ones I've already attached logs for. Works in 100% cases (2/2) if run under JRE8.
Comment 5 Michel Dänzer 2017-01-07 08:00:29 UTC
It's most likely a Mesa driver issue.

Can you try running Xephyr something like this:

 GALLIUM_DDEBUG="pipelined 2000" Xephyr :99 -glamor -screen 1024x768

and then run the reproducer app with DISPLAY=:99 . After the hang, a file should appear in ~/ddebug_dumps/. Please attach that file here.
Comment 6 Vitaly Ostrosablin 2017-01-07 11:28:58 UTC
Yes, no problem. However, my XOrg was compiled without Xephyr, so I rebuilt it. Unfortunately, I've decided to update mesa to latest commit as well, but it seems that one of recent commits breaks everything (I get an unusable desktop which looks white and can see gray outlines of KDE taskbar, mouse cursor and login password box cursor). So, I had to temporarily revert to 13.0.3 mesa. But there some useful info. First, on 13.0.3 I cannot reproduce fault with reproducer app. Second, reproducer app looks same under 13.0.3 both on Java 7 and Java 8. I found it strange that under Java 7 swing app looks like it should (Metal look & feel) and under Java 8 it looked different (white buttons instead of default metallic). But it appears that this was just a rendering artefact.

I will try to get back to working mesa commit and reproduce the problem with Xephyr now.
Comment 7 Vitaly Ostrosablin 2017-01-07 11:53:55 UTC
Created attachment 128804 [details]
Comparison of reproducer app under java 8 and java 7 with git mesa

Checked out two days old revision of mesa. Attached screenshot of what I meant about reproducer app.

Will try running it in Xephyr.
Comment 8 Vitaly Ostrosablin 2017-01-07 12:16:39 UTC
Attempted to run reproducer app in Xephyr. It appears exactly like on host with Java 8 in attached screenshot. I.e. with white button. However, clicking the button just adds the text into textarea, as programmed, while doing this directly on host's Xorg hangs the system.

I think it's possible that app alone is not enough to reproduce the issue, KDE and it's window manager might be at play here, too. But it's strange, because this seems to occur on adding text to JTextArea, which is rendered by Swing and should be least affected by WM and DE (except for window border, which is absent in Xephyr, since no WM runs there).

But if that's mesa-only bug, shouldn't Ctrl+Alt+F1 work? Here GPU appears to have stopped output completely (most likely, fullscreen tty opens, but GPU shows same picture as on moment of freeze).
Comment 9 Michel Dänzer 2017-01-09 07:18:39 UTC
Any chance you can bisect Mesa?

(In reply to Vitaly Ostrosablin from comment #8)
> But if that's mesa-only bug, shouldn't Ctrl+Alt+F1 work? Here GPU appears to
> have stopped output completely (most likely, fullscreen tty opens, but GPU
> shows same picture as on moment of freeze).

A GPU hang tends to cause the Xorg process to hang as well, which prevents VT switching from working.
Comment 10 Vitaly Ostrosablin 2017-01-09 19:12:30 UTC
Yes, will try to bisect mesa. Unfortunately, in looks like I'll have to do that manually, since Gentoo doesn't seem to have bisect tools for portage. So far I can say following initial info:

1) Bug wasn't introduced at least until November 30, 2016.
2) White button artefact doesn't seem to be related to hang. In Nov 30 commit button is white, but pressing it doesn't hang the system.
3) On Dec 20, 2016, hang was already introduced.
Comment 11 Vitaly Ostrosablin 2017-01-09 19:35:47 UTC
Further narrowed date range: between Dec 6 and Dec 12.
Comment 12 Vitaly Ostrosablin 2017-01-10 17:22:02 UTC
Looks like it broke on Dec 07. There was a lot of radeonsi-related commits, but I had difficulty compiling a working mesa out of them. On Dec 6, there was no bug. No commits on Dec 7 seems to work, they're segfaulting. Then later on Dec 8, mesa can be compiled an started, but issue is already present.
Comment 13 Ilia Mirkin 2017-01-10 18:45:03 UTC
Vitaly - commit id's please. Dates are largely meaningless - the default date shown by git has little to do with when the commit made it into a particular tree, even with mesa's rebase policy.
Comment 14 Vitaly Ostrosablin 2017-01-10 19:23:45 UTC
85a3057f651a1c56348f1af18343d9cc0a5c93f3 used to work fine.

After that, in at most 3 commits to future from this point something was broken and mesa didn't run (checked on 4c8c13b3568c82e503a10ddcb846b4c96261ec4c).

One of commits further in history I tried was 132b69c4edb824c70c98f8937c63e49b04f3adff, which didn't work as well.

After it, there was a huge batch of radeonsi commits.

c7dc1b010ae581f532240b661cb3d1c82e117e7e is not runnable, too.

bd56de88dfb192310f3432a3c0e0ddc3469c6d55 is runnable (probably, was fixed somewhere earlier) and java reproducer app hangs system there.
Comment 15 Michel Dänzer 2017-01-11 09:43:28 UTC
For any commits that you can't test, run

 git bisect skip

Eventually, git bisect will either show the commit which introduced the problem, or the minimal set of candidates.
Comment 16 Vitaly Ostrosablin 2017-01-12 16:56:25 UTC
Have successfully updated to latest mesa. Seems like issue was fixed recently.
Comment 17 Timothy Arceri 2018-04-04 10:54:52 UTC
(In reply to Vitaly Ostrosablin from comment #16)
> Have successfully updated to latest mesa. Seems like issue was fixed
> recently.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.